GitHunt
TA

tarekmasryo/road-accident-risk-ps5e10

Road accident risk regression (PS S5E10): LightGBM + residual XGBoost + NNLS blend for stable OOF RMSE.

๐Ÿšฆ Road Accident Risk โ€” Residual-Boosted Risk Model

Playground Series S5E10 (Kaggle)


๐ŸŽฏ What is this?

Predict accident_risk for each road segment as a calibrated score in [0,1].

Goal:

  • Stable CV, not leaderboard luck
  • Interpretable signals (why is this road risky?)
  • Zero leakage

๐Ÿง  Modeling Pipeline (3 stages)

1. LightGBM (main learner)

  • Train LightGBM directly on accident_risk
  • Bag multiple random seeds โ†’ smoother OOF preds
  • Output: oof_lgb, pred_lgb

2. XGBoost residual (prior-corrected)

  • Build an interpretable safety prior risk_prior โˆˆ [0,1]

    • high curvature
    • high speed limit
    • night lighting
    • bad weather
  • Train XGBoost on the residual:
    residual_target = accident_risk - risk_prior

  • At inference:
    pred = risk_prior + predicted_residual

  • Output: oof_xgb, pred_xgb

Why? Stage 2 is only learning what the simple prior missed.

3. NNLS blend (non-negative)

  • Fit Non-Negative Least Squares (NNLS) on [oof_lgb, oof_xgb]
  • Get blend weights โ‰ฅ 0 (no negative canceling)
  • Apply same weights to test preds
  • Clip final predictions to [0,1]
  • Output: final_test โ†’ submission.csv

Result:

  • Lower OOF RMSE
  • More consistent folds
  • Predictions always in a valid range

๐Ÿ”ฌ Features & CV

Feature engineering

  • curv_speed = curvature ร— speed_limit
  • acc_per_lane = num_reported_accidents / num_lanes
  • critical_zone = high curvature & high speed
  • risk_prior = human-readable baseline danger score

Cross-validation

  • Stratified K-Fold on binned target quantiles
  • Keeps each fold balanced (safe vs dangerous segments)
  • All metrics are out-of-fold (OOF)

๐Ÿ“‚ Output

The notebook will:

  1. Train Stage 1 โ†’ Stage 2 โ†’ Stage 3
  2. Blend predictions
  3. Write submission.csv under artifacts/ (Kaggle: /kaggle/working/artifacts/)

No external data. No test target leakage.


๐Ÿ“ Repo layout

.
โ”œโ”€โ”€ road-accident-risk-ps5e10.ipynb
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ raw/                     # (optional) place train/test CSVs here for local runs
โ”œโ”€โ”€ artifacts/                   # saved outputs (e.g., submission.csv)
โ”œโ”€โ”€ repo_utils/
โ”‚   โ””โ”€โ”€ pathing.py               # local data/raw + Kaggle /kaggle/input fallback
โ”œโ”€โ”€ CASE_STUDY.md
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ .gitignore

๐Ÿ“ฆ Data loading (local + Kaggle)

The notebook resolves files in this order:

  1. DATA_PATH env var (full file path)
  2. Local data/raw/<filename>
  3. Kaggle /kaggle/input/<dataset>/<filename>

For local runs, place these files under data/raw/:

  • train.csv
  • test.csv
  • sample_submission.csv

๐Ÿš€ Run locally

python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install -r requirements.txt

Open road-accident-risk-ps5e10.ipynb and run top-to-bottom.


๐Ÿ“ค Outputs

  • artifacts/submission.csv (Kaggle: /kaggle/working/artifacts/submission.csv)