GitHunt
DE

deaneeth/telco-churn-prediction-mlops

Production-ready ML pipeline for telco customer churn prediction using advanced ensemble methods (XGBoost, CatBoost, Random Forest). Handles class imbalance, provides business insights, and includes modular MLOps architecture. Built with scikit-learn, featuring comprehensive EDA, feature engineering, and business impact analysis.

๐Ÿ“Š Telco Customer Churn Prediction - MLOps Project

Python
scikit-learn
imbalanced-learn

โš ๏ธ Note: This project requires scikit-learn 1.7.0 and imbalanced-learn 0.14.0. Using different versions may cause compatibility issues.

๐Ÿ“Œ Project Overview

This project implements an end-to-end machine learning pipeline for predicting customer churn in a telecommunications company. By identifying customers likely to cancel their services, the company can implement targeted retention strategies.

Key Features

  • Data Analysis: Exploratory data analysis with visualizations
  • Machine Learning Pipeline: Modular preprocessing and model training
  • Multiple Models: Comparison of models from baseline to advanced ensemble methods
  • Production-Ready: FastAPI implementation for real-time inference
  • Business Insights: ROI calculations and customer segmentation

๐Ÿข Business Context

This project helps telecommunications companies:

  • ๐Ÿ” Identify high-risk customers before they churn
  • ๐Ÿงฉ Segment customers for targeted retention efforts
  • ๐Ÿ’ฐ Estimate the financial impact of retention campaigns
  • โšก Deploy real-time predictions for new customer data

๐Ÿ“ Project Structure

Main components:

  • ๐Ÿ““ Notebooks: Step-by-step workflow from EDA to business impact analysis
  • ๐Ÿ’ป Source Code: Modular Python code for all pipeline components
  • ๐Ÿค– Models: Trained machine learning models
  • ๐Ÿ“Š Data: Raw and processed datasets
  • ๐Ÿ“ˆ Reports: Generated outputs and business insights

For a detailed directory structure of the project, see PROJECT_STRUCTURE.md.

๐Ÿš€ Installation

  1. Clone the repository

    git clone https://github.com/deaneeth/telco-churn-prediction-mlops.git
    cd telco-churn-prediction-mlops
  2. Create a virtual environment and install dependencies

    python -m venv venv
    .\venv\Scripts\activate  # Windows
    # OR
    source venv/bin/activate  # macOS/Linux
    pip install -r requirements.txt

๐Ÿ“‹ Dataset

The dataset contains customer information including:

  • ๐Ÿ‘ค Demographics (gender, age, partner status)
  • ๐Ÿ“ Account details (tenure, contract type, payment method)
  • ๐Ÿ“ž Services subscribed (phone, internet, add-ons)
  • ๐Ÿ’ต Financial information (monthly charges, total charges)
  • ๐ŸŽฏ Target variable: customer churn status

๐Ÿ”„ Workflow

  1. ๐Ÿ” Data Exploration: Analysis of customer churn patterns and feature relationships
  2. ๐Ÿงน Preprocessing: Data cleaning, feature engineering, and preprocessing pipeline
  3. ๐Ÿ› ๏ธ Model Development: Training multiple models and optimizing performance
  4. ๐Ÿ“Š Evaluation: Comprehensive metrics and model comparison
  5. ๐Ÿ“ฆ Production Pipeline: Packaging for deployment
  6. ๐Ÿ’ผ Business Analysis: Customer segmentation and ROI calculations

๐Ÿ“ˆ Results

Model Evaluation Results

Model ROC-AUC PR-AUC Precision Recall F1
LogisticRegression 0.847129 0.666142 0.500840 0.796791 0.615067
RandomForest 0.841550 0.650611 0.518389 0.791444 0.626455
XGBoost 0.846979 0.660561 0.685714 0.513369 0.587156
CatBoost 0.845748 0.666034 0.513605 0.807487 0.627859
StackingEnsemble 0.842646 0.644170 0.667832 0.510695 0.578788
DecisionTree 0.627193 0.352190 0.461972 0.438503 0.449931

๐Ÿš€ Improvement Note: These results can be further improved by implementing:

  • Advanced feature engineering techniques (RFM analysis, customer behavior patterns)
  • Deep learning approaches (Neural networks for complex pattern recognition)
  • Hyperparameter tuning with more extensive search space
  • Better handling of class imbalance using advanced sampling techniques

This project will be enhanced with these improvements in the near future.

๐Ÿ’ก Key Insights

  • ๐Ÿฅ‡ CatBoost achieved the highest F1 score (0.627859) with the best balance of precision and recall
  • ๐Ÿ“Š LogisticRegression shows strong performance with high ROC-AUC (0.847129)
  • ๐ŸŽฏ XGBoost provides the highest precision (0.685714) but with lower recall
  • ๐Ÿ“ˆ The ensemble models generally outperform the baseline models (Decision Tree)

๐Ÿ”‘ Feature Importance

  • ๐Ÿ“ Contract type, tenure, and service issues are the strongest predictors
  • โš ๏ธ Month-to-month contracts with technical issues show highest churn rates
  • ๐Ÿ’ฐ Targeted interventions show 3-5x return on investment

๐Ÿš€ Deployment

The model will be deployed as a real-time prediction service using FastAPI:

will be updated...

๐Ÿ”ฎ Future Work

  • ๐Ÿ”„ Model monitoring and retraining pipeline
  • ๐Ÿ—ƒ๏ธ Feature store for reproducibility
  • ๐Ÿ” Advanced model interpretability
  • ๐Ÿงช A/B testing framework for retention strategies