GitHunt
RO

Rohan-Bharatia/WHSDSC-2026

Source code for the 2026 Wharton High School Data Science Competition

๐Ÿ’ WHSDSC-2026 ๐Ÿ“Š

WHL Analytics & Season Simulation Engine

This repository contains the full modeling and simulation engine built for the 2026 Wharton High School Data Science Competition, representing Cherry Hill High School East.

The project builds a predictive model for hockey games using shift-level data, generates ELO ratings, and runs large-scale season simulations to estimate standings, scoring output, and championship probabilities.

๐Ÿš€ What This Project Does

The pipeline performs the following steps:

  1. Loads raw shift data into a structured SQLite database
  2. Builds matchup features from line combinations
  3. Constructs dynamic ELO ratings from game results
  4. Trains LightGBM goal models for home and away teams
  5. Simulates full seasons thousands of times
  6. Aggregates standings statistics across simulations

The result is a probabilistic forecast of:

  • Average rank
  • Average points
  • Goals for / against
  • Championship probability

โš™๏ธ Installation

  1. Install Python (v3.10+)

Confirm the installation with:

python --version
  1. (Recommended) Create a virtual environment:
python -m venv .venv

Windows:

.venv\Scripts\activate

macOS / Linux:

source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt

โ–ถ๏ธ Running the Full Simulation Pipeline

Run:

python main.py --seasons 1000

If --seasons is omitted, the default is 1000 simulated seasons.

During execution, the program will:

  • Rebuild the database
  • Train goal models
  • Build ELO ratings
  • Run parallel season simulations
  • Print aggregated standings

Example output:

  1. Team A           Avg Rank: 2.14  Avg Pts: 94.3  GF: 243.2  GA: 201.8  Title%: 31.4%
...

A full example output after 5000 simulated seasons is linked at output.txt

๐Ÿ““ Jupyter Notebook Version

A notebook version of the pipeline is available at:

notebook/WHL_Simulation.py

To run it:

jupyter notebook

Then open the notebook in your browser.

The notebook walks through:

  • Data loading
  • Feature construction
  • Model training
  • Simulation
  • Standings aggregation

This is ideal for experimentation and visualization.

๐Ÿง  Modeling Approach

๐ŸŽฏ Goal Prediction

Two LightGBM regression models are trained:

  • Home goals model
  • Away goals model

Features include:

  • Expected goals (xG)
  • Time on ice
  • Rate metrics (per-60 stats)
  • ELO ratings
  • ELO differential

๐Ÿ“ˆ ELO Rating System

An internal ELO system updates team strength using aggregated game performance.

Ratings are used both:

  • As model features
  • As dynamic inputs during simulation

๐ŸŽฒ Season Simulation

Each simulated season:

  • Uses trained goal models
  • Simulates each matchup
  • Updates standings
  • Tracks overtime results
  • Repeats across N iterations

Parallel processing is used for performance.

๐Ÿ“Š Output Metrics

For each team, the engine computes:

  • Average Rank
  • Average Points
  • Average Goals For
  • Average Goals Against
  • Championship Probability

These are computed across all simulated seasons.

๐Ÿ”ฌ Reproducibility

  • Global seed is fixed (np.random.seed(42))
  • Each simulated season uses deterministic seed offsets
  • Database rebuild ensures clean state

This guarantees reproducible results across runs.

๐Ÿ“ License

This project is licensed under the MIT License.
See the LICENSE file for details.