deaneeth/churn-prediction-production-pipeline
Production-grade machine learning pipeline for customer churn prediction with modular components for data preprocessing, model training, and streaming inference. This third repo in a series builds on previous work to create a deployment-ready prediction system with comprehensive configuration, evaluation metrics, and scalable architecture.
๐ Customer Churn Prediction Production Pipeline
Welcome to the production pipeline phase of the Customer Churn Prediction project! This is the third repository in a series focused on building a complete churn prediction system.
This repository follows the work from Customer Churn Prediction โ EDA & Data Preprocessing Pipeline and Customer Churn Prediction โ Model Training & Evaluation Pipeline.
Here, we bring together all learnings from previous phases to create a comprehensive end-to-end machine learning pipeline for customer churn prediction. The pipeline is designed with production deployment in mind, implementing a robust ML workflow from data ingestion to model deployment and streaming inference.
๐ Overview
This project provides a production-ready pipeline for predicting customer churn based on historical data. It includes comprehensive data preprocessing, feature engineering, model training, evaluation, and inference capabilities.
The modular architecture allows for easy maintenance, scaling, and adaptation to similar prediction problems.
๐ง Development Status
| Pipeline | Status | Description |
|---|---|---|
| ๐ Data Pipeline | โ Fully Functional | Complete data preprocessing pipeline ready for production use |
| ๐ง Training Pipeline | ๐ง Under Development | Model training and evaluation - coming soon |
| ๐ฎ Streaming Inference Pipeline | ๐ง Under Development | Real-time prediction service - coming soon |
๐ Note: The training and streaming inference pipelines are currently under active development and will be updated to fully functional status soon. The data pipeline is production-ready and can be used independently.
๐ Steps Followed from the Previous Repositories
If you're new to this series, it's recommended to explore the previous repositories first:
-
๐ Customer Churn Prediction โ EDA & Data Preprocessing Pipeline
- Exploratory data analysis and visualization
- Data cleaning and preprocessing techniques
- Feature engineering fundamentals
- Handling missing values and outliers
-
๐ง Customer Churn Prediction โ Model Training & Evaluation Pipeline
- Model selection and training workflows
- Hyperparameter tuning strategies
- Cross-validation approaches
- Performance evaluation metrics
-
๐ Current Repository: Production Pipeline
- End-to-end production architecture
- Streaming inference capability
- Model versioning and monitoring
- Deployment-ready code structure
๐ This repository combines the learnings from both previous phases and adds production-level architecture for deployment-ready inference capabilities. Note that this is part of an ongoing series, with more advanced implementations planned for future repositories.
๐ Project Structure
churn-prediction-production-pipeline/
โโโ artifacts/ # Model artifacts and processed data
โ โโโ data/ # Split datasets (X_train, X_test, Y_train, Y_test)
โ โโโ encode/ # Encoding artifacts for categorical features
โ โโโ models/ # Trained model files (configured in config.yaml)
โ โโโ evaluation/ # Model evaluation reports
โ โโโ predictions/ # Prediction outputs
โโโ data/ # Data directory
โ โโโ raw/ # Raw dataset (ChurnModelling.csv)
โ โโโ imputed/ # Temporary storage for imputed data
โ โโโ processed/ # Fully processed datasets
โโโ pipelines/ # End-to-end pipelines
โ โโโ data_pipeline.py # Data preprocessing pipeline
โ โโโ training_pipeline.py # Model training pipeline
โ โโโ streaming_inference_pipeline.py # Inference pipeline
โโโ src/ # Core functionality modules
โ โโโ data_ingestion.py # Data loading utilities
โ โโโ data_splitter.py # Train-test splitting
โ โโโ feature_binning.py # Feature discretization
โ โโโ feature_encoding.py # Categorical encoding
โ โโโ feature_scaling.py # Feature normalization
โ โโโ handle_missing_values.py # Imputation strategies
โ โโโ model_building.py # Model architecture
โ โโโ model_evaluation.py # Performance metrics
โ โโโ model_inference.py # Prediction service
โ โโโ model_training.py # Training utilities
โ โโโ outlier_detection.py # Outlier handling
โโโ utils/ # Helper utilities
โ โโโ config.py # Configuration management
โโโ config.yaml # Configuration parameters
โโโ Makefile # Automation commands
โโโ requirements.txt # Dependenciesโจ Features
๐งน Comprehensive Data Preprocessing Pipeline
- Missing value imputation
- Outlier detection and handling
- Feature binning and encoding
- Feature scaling
๐ง Flexible Model Training
- Multiple algorithm support
- Cross-validation
- Hyperparameter tuning
๐ Robust Model Evaluation
- Performance metrics calculation
- Model comparison
๐ Production-Ready Inference Pipeline
- Streaming prediction capability
- Model versioning
โ๏ธ Configurable Pipeline
- YAML-based configuration
- Easy parameter tuning
๐ Requirements
- Python 3.11+ (compatible with Python 3.11, 3.12, and 3.13)
- Pandas >= 1.5.0
- NumPy >= 1.21.0
- Scikit-learn >= 1.1.0
- XGBoost >= 1.6.0
- LightGBM >= 3.3.0
- FastAPI >= 0.95.0 (for API deployment)
- Groq >= 0.11.0 (for advanced imputation)
- Additional packages listed in
requirements.txt
๐ ๏ธ Installation
๐ Installation Steps
- Clone this repository:
git clone https://github.com/deaneeth/churn-prediction-production-pipeline.git
cd churn-prediction-production-pipeline- Create a virtual environment (optional but recommended):
# For Unix/Mac
python -m venv venv
source venv/bin/activate
# For Windows
python -m venv .venv
.venv\Scripts\activate- Install the required packages:
pip install -r requirements.txt๐ง Using the Makefile (Windows)
The project includes a Makefile for common operations:
# Install dependencies and set up environment
make install
# Run the data pipeline
make data-pipeline
# Run the training pipeline
make train-pipeline
# Run the streaming inference pipeline
make streaming-inference
# Run all pipelines in sequence
make run-all
# Get help on available commands
make help๐ Usage
๐ Data Preprocessing Pipeline
from pipelines.data_pipeline import data_pipeline
# Run the complete data preprocessing pipeline
data = data_pipeline(data_path="data/raw/ChurnModelling.csv")๐งช Model Training Pipeline
from pipelines.training_pipeline import train_model
# Train and evaluate the model
model, metrics = train_model(model_type="random_forest")๐ฎ Inference Pipeline
from pipelines.streaming_inference_pipeline import predict
# Make predictions on new data
predictions = predict(input_data)๐ง Pipeline Components
๐ Data Pipeline
The data pipeline handles:
- ๐ฅ Data ingestion from CSV files
- ๐งฉ Missing value imputation (mean, mode, custom strategies)
- ๐ Outlier detection using IQR or Z-score methods
- ๐ Feature binning for numeric variables
- ๐ Encoding of categorical variables
- โ๏ธ Feature scaling
- โ๏ธ Train-test splitting
๐ง Training Pipeline
The training pipeline implements:
- ๐ค Model selection from multiple algorithms
- ๐ Model training with cross-validation
- ๐๏ธ Hyperparameter tuning (optional)
- ๐ Performance evaluation
- ๐พ Model persistence
๐ฎ Inference Service
The inference pipeline provides:
- ๐ค Loading of trained models
- ๐ Data preprocessing for new inputs
- ๐ฎ Prediction generation
- ๐ Result formatting
โ๏ธ Configuration
All pipeline parameters are configured in config.yaml. Key configuration sections include:
| Section | Description |
|---|---|
| ๐ Data Paths | Locations of raw data, processed data, and artifacts |
| ๐ Columns | Target variable, feature columns, columns to drop |
| ๐งน Data Preprocessing | Strategies for handling missing values, outliers, etc. |
| ๐ง Feature Engineering | Binning, encoding, scaling parameters |
| ๐ง Training | Model type, training strategy, hyperparameter tuning |
๐ Model Performance
The pipeline includes robust evaluation metrics for model performance, including:
| Metric | Description |
|---|---|
| โ Accuracy | Overall prediction correctness |
| ๐ Precision, Recall, F1-score | Class-specific performance metrics |
| ๐ ROC AUC | Classification quality at various thresholds |
| ๐ข Confusion Matrix | Detailed breakdown of predictions vs. actual values |
Performance metrics are calculated during model training and can be accessed through the training pipeline output.
๐ Deployment
This project is designed to be deployed in a production environment. The inference pipeline supports streaming predictions for real-time applications.
๐ Streaming Inference Pipeline
The streaming inference pipeline provides real-time prediction capabilities:
- FastAPI Integration: Ready for RESTful API deployment
- Batch Processing: Support for both single requests and batch predictions
- Probability Output: Returns both predictions and probability scores
- Real-time Processing: Designed for low-latency inference
- Configurable: Easily adjusted through the
config.yamlsettings
Example of deploying the streaming API:
uvicorn pipelines.streaming_inference_pipeline:app --reload --port 8000After deployment, predictions can be obtained by sending POST requests with customer data to the /predict endpoint.
๐ฅ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Created with โค๏ธ by deaneeth