deaneeth/telco-churn-prediction-mlops
Production-ready ML pipeline for telco customer churn prediction using advanced ensemble methods (XGBoost, CatBoost, Random Forest). Handles class imbalance, provides business insights, and includes modular MLOps architecture. Built with scikit-learn, featuring comprehensive EDA, feature engineering, and business impact analysis.
๐ Telco Customer Churn Prediction - MLOps Project
โ ๏ธ Note: This project requires scikit-learn 1.7.0 and imbalanced-learn 0.14.0. Using different versions may cause compatibility issues.
๐ Project Overview
This project implements an end-to-end machine learning pipeline for predicting customer churn in a telecommunications company. By identifying customers likely to cancel their services, the company can implement targeted retention strategies.
Key Features
- Data Analysis: Exploratory data analysis with visualizations
- Machine Learning Pipeline: Modular preprocessing and model training
- Multiple Models: Comparison of models from baseline to advanced ensemble methods
- Production-Ready: FastAPI implementation for real-time inference
- Business Insights: ROI calculations and customer segmentation
๐ข Business Context
This project helps telecommunications companies:
- ๐ Identify high-risk customers before they churn
- ๐งฉ Segment customers for targeted retention efforts
- ๐ฐ Estimate the financial impact of retention campaigns
- โก Deploy real-time predictions for new customer data
๐ Project Structure
Main components:
- ๐ Notebooks: Step-by-step workflow from EDA to business impact analysis
- ๐ป Source Code: Modular Python code for all pipeline components
- ๐ค Models: Trained machine learning models
- ๐ Data: Raw and processed datasets
- ๐ Reports: Generated outputs and business insights
For a detailed directory structure of the project, see PROJECT_STRUCTURE.md.
๐ Installation
-
Clone the repository
git clone https://github.com/deaneeth/telco-churn-prediction-mlops.git cd telco-churn-prediction-mlops -
Create a virtual environment and install dependencies
python -m venv venv .\venv\Scripts\activate # Windows # OR source venv/bin/activate # macOS/Linux pip install -r requirements.txt
๐ Dataset
The dataset contains customer information including:
- ๐ค Demographics (gender, age, partner status)
- ๐ Account details (tenure, contract type, payment method)
- ๐ Services subscribed (phone, internet, add-ons)
- ๐ต Financial information (monthly charges, total charges)
- ๐ฏ Target variable: customer churn status
๐ Workflow
- ๐ Data Exploration: Analysis of customer churn patterns and feature relationships
- ๐งน Preprocessing: Data cleaning, feature engineering, and preprocessing pipeline
- ๐ ๏ธ Model Development: Training multiple models and optimizing performance
- ๐ Evaluation: Comprehensive metrics and model comparison
- ๐ฆ Production Pipeline: Packaging for deployment
- ๐ผ Business Analysis: Customer segmentation and ROI calculations
๐ Results
Model Evaluation Results
| Model | ROC-AUC | PR-AUC | Precision | Recall | F1 |
|---|---|---|---|---|---|
| LogisticRegression | 0.847129 | 0.666142 | 0.500840 | 0.796791 | 0.615067 |
| RandomForest | 0.841550 | 0.650611 | 0.518389 | 0.791444 | 0.626455 |
| XGBoost | 0.846979 | 0.660561 | 0.685714 | 0.513369 | 0.587156 |
| CatBoost | 0.845748 | 0.666034 | 0.513605 | 0.807487 | 0.627859 |
| StackingEnsemble | 0.842646 | 0.644170 | 0.667832 | 0.510695 | 0.578788 |
| DecisionTree | 0.627193 | 0.352190 | 0.461972 | 0.438503 | 0.449931 |
๐ Improvement Note: These results can be further improved by implementing:
- Advanced feature engineering techniques (RFM analysis, customer behavior patterns)
- Deep learning approaches (Neural networks for complex pattern recognition)
- Hyperparameter tuning with more extensive search space
- Better handling of class imbalance using advanced sampling techniques
This project will be enhanced with these improvements in the near future.
๐ก Key Insights
- ๐ฅ CatBoost achieved the highest F1 score (0.627859) with the best balance of precision and recall
- ๐ LogisticRegression shows strong performance with high ROC-AUC (0.847129)
- ๐ฏ XGBoost provides the highest precision (0.685714) but with lower recall
- ๐ The ensemble models generally outperform the baseline models (Decision Tree)
๐ Feature Importance
- ๐ Contract type, tenure, and service issues are the strongest predictors
โ ๏ธ Month-to-month contracts with technical issues show highest churn rates- ๐ฐ Targeted interventions show 3-5x return on investment
๐ Deployment
The model will be deployed as a real-time prediction service using FastAPI:
will be updated...๐ฎ Future Work
- ๐ Model monitoring and retraining pipeline
- ๐๏ธ Feature store for reproducibility
- ๐ Advanced model interpretability
- ๐งช A/B testing framework for retention strategies