Rohan-Bharatia/WHSDSC-2026
Source code for the 2026 Wharton High School Data Science Competition
๐ WHSDSC-2026 ๐
WHL Analytics & Season Simulation Engine
This repository contains the full modeling and simulation engine built for the 2026 Wharton High School Data Science Competition, representing Cherry Hill High School East.
The project builds a predictive model for hockey games using shift-level data, generates ELO ratings, and runs large-scale season simulations to estimate standings, scoring output, and championship probabilities.
๐ What This Project Does
The pipeline performs the following steps:
- Loads raw shift data into a structured SQLite database
- Builds matchup features from line combinations
- Constructs dynamic ELO ratings from game results
- Trains LightGBM goal models for home and away teams
- Simulates full seasons thousands of times
- Aggregates standings statistics across simulations
The result is a probabilistic forecast of:
- Average rank
- Average points
- Goals for / against
- Championship probability
โ๏ธ Installation
- Install Python (v3.10+)
Confirm the installation with:
python --version- (Recommended) Create a virtual environment:
python -m venv .venvWindows:
.venv\Scripts\activatemacOS / Linux:
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txtโถ๏ธ Running the Full Simulation Pipeline
Run:
python main.py --seasons 1000If --seasons is omitted, the default is 1000 simulated seasons.
During execution, the program will:
- Rebuild the database
- Train goal models
- Build ELO ratings
- Run parallel season simulations
- Print aggregated standings
Example output:
1. Team A Avg Rank: 2.14 Avg Pts: 94.3 GF: 243.2 GA: 201.8 Title%: 31.4%
...
A full example output after 5000 simulated seasons is linked at output.txt
๐ Jupyter Notebook Version
A notebook version of the pipeline is available at:
notebook/WHL_Simulation.py
To run it:
jupyter notebookThen open the notebook in your browser.
The notebook walks through:
- Data loading
- Feature construction
- Model training
- Simulation
- Standings aggregation
This is ideal for experimentation and visualization.
๐ง Modeling Approach
๐ฏ Goal Prediction
Two LightGBM regression models are trained:
- Home goals model
- Away goals model
Features include:
- Expected goals (xG)
- Time on ice
- Rate metrics (per-60 stats)
- ELO ratings
- ELO differential
๐ ELO Rating System
An internal ELO system updates team strength using aggregated game performance.
Ratings are used both:
- As model features
- As dynamic inputs during simulation
๐ฒ Season Simulation
Each simulated season:
- Uses trained goal models
- Simulates each matchup
- Updates standings
- Tracks overtime results
- Repeats across N iterations
Parallel processing is used for performance.
๐ Output Metrics
For each team, the engine computes:
- Average Rank
- Average Points
- Average Goals For
- Average Goals Against
- Championship Probability
These are computed across all simulated seasons.
๐ฌ Reproducibility
- Global seed is fixed (
np.random.seed(42)) - Each simulated season uses deterministic seed offsets
- Database rebuild ensures clean state
This guarantees reproducible results across runs.
๐ License
This project is licensed under the MIT License.
See the LICENSE file for details.