GitHunt
DE

deaneeth/churn-prediction-production-pipeline

Production-grade machine learning pipeline for customer churn prediction with modular components for data preprocessing, model training, and streaming inference. This third repo in a series builds on previous work to create a deployment-ready prediction system with comprehensive configuration, evaluation metrics, and scalable architecture.

๐Ÿš€ Customer Churn Prediction Production Pipeline

Churn Prediction
Python
Status

Welcome to the production pipeline phase of the Customer Churn Prediction project! This is the third repository in a series focused on building a complete churn prediction system.

This repository follows the work from Customer Churn Prediction โ€“ EDA & Data Preprocessing Pipeline and Customer Churn Prediction โ€“ Model Training & Evaluation Pipeline.

Here, we bring together all learnings from previous phases to create a comprehensive end-to-end machine learning pipeline for customer churn prediction. The pipeline is designed with production deployment in mind, implementing a robust ML workflow from data ingestion to model deployment and streaming inference.

๐Ÿ” Overview

This project provides a production-ready pipeline for predicting customer churn based on historical data. It includes comprehensive data preprocessing, feature engineering, model training, evaluation, and inference capabilities.

The modular architecture allows for easy maintenance, scaling, and adaptation to similar prediction problems.

๐Ÿšง Development Status

Pipeline Status Description
๐Ÿ”„ Data Pipeline โœ… Fully Functional Complete data preprocessing pipeline ready for production use
๐Ÿง  Training Pipeline ๐Ÿšง Under Development Model training and evaluation - coming soon
๐Ÿ”ฎ Streaming Inference Pipeline ๐Ÿšง Under Development Real-time prediction service - coming soon

๐Ÿ“… Note: The training and streaming inference pipelines are currently under active development and will be updated to fully functional status soon. The data pipeline is production-ready and can be used independently.

๐Ÿ“Œ Steps Followed from the Previous Repositories

If you're new to this series, it's recommended to explore the previous repositories first:

  1. ๐Ÿ“Š Customer Churn Prediction โ€“ EDA & Data Preprocessing Pipeline

    • Exploratory data analysis and visualization
    • Data cleaning and preprocessing techniques
    • Feature engineering fundamentals
    • Handling missing values and outliers
  2. ๐Ÿง  Customer Churn Prediction โ€“ Model Training & Evaluation Pipeline

    • Model selection and training workflows
    • Hyperparameter tuning strategies
    • Cross-validation approaches
    • Performance evaluation metrics
  3. ๐Ÿš€ Current Repository: Production Pipeline

    • End-to-end production architecture
    • Streaming inference capability
    • Model versioning and monitoring
    • Deployment-ready code structure

๐Ÿ”„ This repository combines the learnings from both previous phases and adds production-level architecture for deployment-ready inference capabilities. Note that this is part of an ongoing series, with more advanced implementations planned for future repositories.

๐Ÿ“ Project Structure

churn-prediction-production-pipeline/
โ”œโ”€โ”€ artifacts/                                     # Model artifacts and processed data
โ”‚   โ”œโ”€โ”€ data/                                      # Split datasets (X_train, X_test, Y_train, Y_test)
โ”‚   โ”œโ”€โ”€ encode/                                    # Encoding artifacts for categorical features
โ”‚   โ”œโ”€โ”€ models/                                    # Trained model files (configured in config.yaml)
โ”‚   โ”œโ”€โ”€ evaluation/                                # Model evaluation reports
โ”‚   โ””โ”€โ”€ predictions/                               # Prediction outputs
โ”œโ”€โ”€ data/                                          # Data directory
โ”‚   โ”œโ”€โ”€ raw/                                       # Raw dataset (ChurnModelling.csv)
โ”‚   โ”œโ”€โ”€ imputed/                                   # Temporary storage for imputed data
โ”‚   โ””โ”€โ”€ processed/                                 # Fully processed datasets
โ”œโ”€โ”€ pipelines/                                     # End-to-end pipelines
โ”‚   โ”œโ”€โ”€ data_pipeline.py                           # Data preprocessing pipeline
โ”‚   โ”œโ”€โ”€ training_pipeline.py                       # Model training pipeline
โ”‚   โ””โ”€โ”€ streaming_inference_pipeline.py            # Inference pipeline
โ”œโ”€โ”€ src/                                           # Core functionality modules
โ”‚   โ”œโ”€โ”€ data_ingestion.py                          # Data loading utilities
โ”‚   โ”œโ”€โ”€ data_splitter.py                           # Train-test splitting
โ”‚   โ”œโ”€โ”€ feature_binning.py                         # Feature discretization
โ”‚   โ”œโ”€โ”€ feature_encoding.py                        # Categorical encoding
โ”‚   โ”œโ”€โ”€ feature_scaling.py                         # Feature normalization
โ”‚   โ”œโ”€โ”€ handle_missing_values.py                   # Imputation strategies
โ”‚   โ”œโ”€โ”€ model_building.py                          # Model architecture
โ”‚   โ”œโ”€โ”€ model_evaluation.py                        # Performance metrics
โ”‚   โ”œโ”€โ”€ model_inference.py                         # Prediction service
โ”‚   โ”œโ”€โ”€ model_training.py                          # Training utilities
โ”‚   โ””โ”€โ”€ outlier_detection.py                       # Outlier handling
โ”œโ”€โ”€ utils/                                         # Helper utilities
โ”‚   โ””โ”€โ”€ config.py                                  # Configuration management
โ”œโ”€โ”€ config.yaml                                    # Configuration parameters
โ”œโ”€โ”€ Makefile                                       # Automation commands
โ””โ”€โ”€ requirements.txt                               # Dependencies

โœจ Features

๐Ÿงน Comprehensive Data Preprocessing Pipeline

  • Missing value imputation
  • Outlier detection and handling
  • Feature binning and encoding
  • Feature scaling

๐Ÿง  Flexible Model Training

  • Multiple algorithm support
  • Cross-validation
  • Hyperparameter tuning

๐Ÿ“Š Robust Model Evaluation

  • Performance metrics calculation
  • Model comparison

๐Ÿ”„ Production-Ready Inference Pipeline

  • Streaming prediction capability
  • Model versioning

โš™๏ธ Configurable Pipeline

  • YAML-based configuration
  • Easy parameter tuning

๐Ÿ“‹ Requirements

  • Python 3.11+ (compatible with Python 3.11, 3.12, and 3.13)
  • Pandas >= 1.5.0
  • NumPy >= 1.21.0
  • Scikit-learn >= 1.1.0
  • XGBoost >= 1.6.0
  • LightGBM >= 3.3.0
  • FastAPI >= 0.95.0 (for API deployment)
  • Groq >= 0.11.0 (for advanced imputation)
  • Additional packages listed in requirements.txt

๐Ÿ› ๏ธ Installation

๐Ÿ“‹ Installation Steps

  1. Clone this repository:
git clone https://github.com/deaneeth/churn-prediction-production-pipeline.git
cd churn-prediction-production-pipeline
  1. Create a virtual environment (optional but recommended):
# For Unix/Mac
python -m venv venv
source venv/bin/activate

# For Windows
python -m venv .venv
.venv\Scripts\activate
  1. Install the required packages:
pip install -r requirements.txt

๐Ÿ”ง Using the Makefile (Windows)

The project includes a Makefile for common operations:

# Install dependencies and set up environment
make install

# Run the data pipeline
make data-pipeline

# Run the training pipeline
make train-pipeline

# Run the streaming inference pipeline
make streaming-inference

# Run all pipelines in sequence
make run-all

# Get help on available commands
make help

๐Ÿ“ Usage

๐Ÿ”„ Data Preprocessing Pipeline

from pipelines.data_pipeline import data_pipeline

# Run the complete data preprocessing pipeline
data = data_pipeline(data_path="data/raw/ChurnModelling.csv")

๐Ÿงช Model Training Pipeline

from pipelines.training_pipeline import train_model

# Train and evaluate the model
model, metrics = train_model(model_type="random_forest")

๐Ÿ”ฎ Inference Pipeline

from pipelines.streaming_inference_pipeline import predict

# Make predictions on new data
predictions = predict(input_data)

๐Ÿ”ง Pipeline Components

๐Ÿ” Data Pipeline

The data pipeline handles:

  • ๐Ÿ“ฅ Data ingestion from CSV files
  • ๐Ÿงฉ Missing value imputation (mean, mode, custom strategies)
  • ๐Ÿ”Ž Outlier detection using IQR or Z-score methods
  • ๐Ÿ“Š Feature binning for numeric variables
  • ๐Ÿ”„ Encoding of categorical variables
  • โš–๏ธ Feature scaling
  • โœ‚๏ธ Train-test splitting

๐Ÿง  Training Pipeline

The training pipeline implements:

  • ๐Ÿค– Model selection from multiple algorithms
  • ๐Ÿ”„ Model training with cross-validation
  • ๐ŸŽ›๏ธ Hyperparameter tuning (optional)
  • ๐Ÿ“ Performance evaluation
  • ๐Ÿ’พ Model persistence

๐Ÿ”ฎ Inference Service

The inference pipeline provides:

  • ๐Ÿ“ค Loading of trained models
  • ๐Ÿ” Data preprocessing for new inputs
  • ๐Ÿ”ฎ Prediction generation
  • ๐Ÿ“‹ Result formatting

โš™๏ธ Configuration

All pipeline parameters are configured in config.yaml. Key configuration sections include:

Section Description
๐Ÿ“‚ Data Paths Locations of raw data, processed data, and artifacts
๐Ÿ“Š Columns Target variable, feature columns, columns to drop
๐Ÿงน Data Preprocessing Strategies for handling missing values, outliers, etc.
๐Ÿ”ง Feature Engineering Binning, encoding, scaling parameters
๐Ÿง  Training Model type, training strategy, hyperparameter tuning

๐Ÿ“ˆ Model Performance

The pipeline includes robust evaluation metrics for model performance, including:

Metric Description
โœ… Accuracy Overall prediction correctness
๐Ÿ“Š Precision, Recall, F1-score Class-specific performance metrics
๐Ÿ“‰ ROC AUC Classification quality at various thresholds
๐Ÿ”ข Confusion Matrix Detailed breakdown of predictions vs. actual values

Performance metrics are calculated during model training and can be accessed through the training pipeline output.

๐Ÿš€ Deployment

This project is designed to be deployed in a production environment. The inference pipeline supports streaming predictions for real-time applications.

๐Ÿ”„ Streaming Inference Pipeline

The streaming inference pipeline provides real-time prediction capabilities:

  • FastAPI Integration: Ready for RESTful API deployment
  • Batch Processing: Support for both single requests and batch predictions
  • Probability Output: Returns both predictions and probability scores
  • Real-time Processing: Designed for low-latency inference
  • Configurable: Easily adjusted through the config.yaml settings

Example of deploying the streaming API:

uvicorn pipelines.streaming_inference_pipeline:app --reload --port 8000

After deployment, predictions can be obtained by sending POST requests with customer data to the /predict endpoint.

๐Ÿ‘ฅ Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Created with โค๏ธ by deaneeth