Gotodataru/tennis-ml-models
Machine learning models for predicting tennis match outcomes (ATP & WTA).
Tennis Match Outcome Prediction (ATP & WTA)
This repository contains a complete set of machine learning models for predicting tennis match outcomes across four markets: Match Winner, Total Games, Games Handicap, and First Set Winner. Models are trained separately for ATP (men) and WTA (women) using advanced feature engineering and the CatBoost algorithm.
The project demonstrates:
Feature engineering from historical match data
Model training with temporal validation and isotonic calibration
Comprehensive evaluation reports and diagnostic plots
A clean, modular codebase for reproducibility
π Repository Structure
text
.
βββ .env.example # Example environment variables (not used in current code)
βββ metrics_models.txt # Summary of model performance
βββ ATP/ # Saved ATP models and evaluation reports
β βββ WINNER/model/winner_atp
β βββ TOTAL/model/total_games_atp
β βββ HANDICAP/model/games_diff_atp
β βββ FIRSTSET/model/first_set_winner_atp
βββ WTA/ # Saved WTA models and reports
β βββ ...
βββ scr2/ # Source code
β βββ feature_selection.py # Feature importance analysis & selection
β βββ retrain_final_models.py # Retrain all final models using selected features
β βββ ATP/ # Training scripts for ATP
β β βββ train_winner_atp_clean.py
β β βββ train_total_games_atp_clean.py
β β βββ ...
β βββ WTA/ # Training scripts for WTA
β βββ ...
βββ data/ # Data directory (not included)
Β βββ dbexample.db # Empty database placeholder
Note: The actual data files (.csv, .db) are not included in this repository. You must provide your own historical tennis data (see Data Preparation).
π§ Requirements
Python 3.9+
Required packages: catboost, pandas, numpy, scikit-learn, matplotlib, seaborn, joblib
Install dependencies:
bash
pip install -r requirements.txt
(Create a requirements.txt with the packages listed above if you don't have one.)
π Data Preparation
The training scripts expect CSV files with the following columns (exact names may vary per script; see individual scripts for details):
date β match date (used for temporal split)
gender β 'ATP' or 'WTA'
player1_id, player2_id β unique player identifiers
surface β court surface (e.g., 'Clay', 'Hard', 'Grass')
winner_1 β binary target: 1 if player1 won, else 0
total_games β total games played in the match
games_diff β games won by player1 minus games won by player2
first_set_bin β binary target: 1 if player1 won the first set, else 0
plus numerous feature columns (e.g., player form, headβtoβhead, surface statistics)
Place your prepared data files in the data/ folder. The main training scripts expect files like:
clean_multimarket_features.csv
winner_features.csv
Refer to the code comments in each training script for the exact required column set.
π How to Reproduce the Models
1. Feature Selection
Run feature_selection.py to identify the most important features for each target and gender. This script handles missing values, removes constant and highly correlated features, and computes feature importances using CatBoost.
bash
python scr2/feature_selection.py
Outputs (saved in data/selected_features/):
selected_features__.txt β list of selected feature names
feature_importance__.csv β feature importance values
2. Train the Final Models
Run retrain_final_models.py to train all eight models (4 markets Γ 2 genders) using the top 11 features from the previous step. The script automatically saves models, calibrators, evaluation reports, and diagnostic plots.
bash
python scr2/retrain_final_models.py
Alternatively, you can train individual models using the scripts in scr2/ATP/ and scr2/WTA/, for example:
bash
python scr2/ATP/train_winner_atp_clean.py
3. Model Outputs
Each trained model is stored in a dedicated folder, e.g. ATP/WINNER/model/winner_atp/, containing:
model.cbm β trained CatBoost model
isotonic_calibrator.pkl β calibrator (for classification models)
evaluation_report.json β metrics on validation set
feature_importance.png β topβ20 feature importance plot
learning_curve.png β training/validation loss
roc_curve.png / residuals.png β diagnostic plots
π Model Performance
All models were validated on the most recent 20% of matches (temporal split). Probabilities are calibrated using isotonic regression where indicated.
ATP Models
Model Metric Value Notes
Winner ROC-AUC 0.7513 LogLoss 0.5869, Brier 0.2019
ECE (cal) 0.0125 Isotonic calibration
Total Games MAE 2.006 RΒ² 0.808, RMSE 3.750
ECE (cal) 0.0125 Isotonic calibration
Games Diff MAE 3.217 RΒ² 0.223, RMSE 4.903
ECE (cal) 0.0117 Isotonic calibration
First Set ROC-AUC 0.6988 LogLoss 0.6261, Brier 0.2189
ECE (raw) 0.0063 No calibration needed
WTA Models
Model Metric Value Notes
Winner ROC-AUC 0.7130 LogLoss 0.6107, Brier 0.2130
ECE (cal) 0.0144 Isotonic calibration
Total Games MAE 1.566 RΒ² 0.707, RMSE 3.131
ECE (cal) 0.0144 Isotonic calibration
Games Diff MAE 3.450 RΒ² 0.183, RMSE 5.291
ECE (cal) 0.0206 Isotonic calibration
First Set ROC-AUC 0.7130 LogLoss 0.6107, Brier 0.2130
ECE (raw) 0.0065* Expected <1% (similar to ATP)
ECE = Expected Calibration Error; cal = after isotonic calibration; raw = before calibration.
Key Observations:
ATP Winner and ATP Total models perform at a high level compared to public benchmarks.
Calibration is excellent (ECE <2% for all), crucial for probability estimation.
Games Diff (handicap) is the weakest group (RΒ² ~0.2), reflecting the high variance of game differentials.
π License
This project is openβsource under the MIT License.
π€ Author
Kirill Chernyshev / https://github.com/Gotodataru
π€ Contributing
Issues and pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Disclaimer: This project is for educational and research purposes only. Betting involves financial risk. Past performance does not guarantee future results. Use these models at your own discretion.
