Banking Customer Churn Prediction

Predicting customer churn using machine learning to enable proactive retention strategies and reduce revenue loss for banking institutions.

Business Problem

Customer churn is a critical challenge in the banking sector, with acquiring new customers costing 5-25 times more than retaining existing ones. This project addresses the need for early identification of at-risk customers, enabling banks to:

Reduce Revenue Loss: Predict churners before they leave, allowing timely intervention
Optimize Marketing Spend: Target retention campaigns to high-risk customers
Improve Customer Lifetime Value: Implement personalized retention strategies
Resource Allocation: Focus relationship managers on accounts most likely to churn

Key Findings

Model Performance

Model	Training Accuracy	Testing Accuracy	F1 Score	Precision	Recall
Decision Tree	89.99%	84.65%	0.85	0.85	0.85
Random Forest	90.47%	85.00%	0.85	0.85	0.85

Random Forest demonstrated superior generalization with balanced performance across all metrics.

Critical Business Insights

Customer Churn Distribution

79.6% customers retained vs 20.4% churned — SMOTE applied to balance dataset.

Geography & Product Impact on Churn

Germany shows 2× higher churn rate; two-product customers are most loyal.

Product Strategy: Customers with exactly 2 products have the lowest churn rate. Single-product customers show significantly higher attrition.

Feature Importance Analysis

Feature Importance — Top Predictors of Churn

Age, Account Balance, and Geography dominate churn prediction.

Top Predictors:

Age (log-transformed): Older customers show higher churn propensity
Account Balance: Zero-balance accounts are strong churn indicators
Geography: Regional factors significantly impact retention
Number of Products: Product engagement inversely correlates with churn

Methodology

Data Overview

Dataset: 10,000 customer records from banking institution
Features: 14 variables including demographics, account details, and banking behavior
Target: Binary classification (Churned: Yes/No)
No Data Leakage: Zero duplicate records, no missing values

Preprocessing Pipeline

Feature Engineering
- Categorized NumOfProducts: Single, Two, More than 2
- Binary transformation of Balance: Zero vs Non-zero
- Removed irrelevant identifiers (RowNumber, CustomerId, Surname)
Data Transformation
- Log-normal transformation on Age to correct right skewness
- One-Hot Encoding for categorical variables
- Label encoding for target variable
Class Balancing
- Applied SMOTE (Synthetic Minority Over-sampling Technique)
- Balanced training set: 6,356 samples per class

Model Development

Decision Tree

GridSearchCV hyperparameter tuning (5-fold CV)
Optimized parameters: entropy criterion, max_depth=9, min_samples_leaf=2

Random Forest (Selected Model)

Ensemble of 50 decision trees
Parameters: max_depth=8, min_samples_leaf=3, class_weight='balanced'
Superior generalization and robustness to overfitting

Model Evaluation

Confusion Matrix & ROC-AUC Curve

Strong separation with ROC-AUC = 0.86 indicating robust model discrimination.

Confusion Matrix Analysis:

True Negatives: 1,471 (correctly identified non-churners)
False Positives: 136 (acceptable over-prediction for proactive intervention)
False Negatives: 164 (critical misses requiring model refinement)
True Positives: 229 (successfully identified churners)

ROC-AUC Score: 0.86 indicating strong discriminative ability

Business Recommendations

Immediate Actions

Product Cross-Selling: Incentivize single-product customers to adopt a second product through bundled offerings
Zero-Balance Monitoring: Implement automated alerts for accounts approaching zero balance
Regional Strategy: Investigate and address specific pain points in German market
Age-Targeted Retention: Design age-specific retention programs for high-risk segments

Long-Term Strategy

Predictive Scoring: Deploy model to generate monthly churn risk scores
Intervention Campaigns: Allocate retention budget based on predicted churn probability
A/B Testing: Validate intervention effectiveness on model-identified high-risk customers
Continuous Learning: Retrain model quarterly with new data to maintain accuracy

Technologies Used

Data Processing: Pandas, NumPy
Machine Learning: Scikit-learn, Imbalanced-learn (SMOTE)
Visualization: Matplotlib, Seaborn
Model Selection: GridSearchCV with 5-fold cross-validation

Results Summary

The Random Forest classifier achieves 85% accuracy in predicting customer churn, with balanced precision-recall tradeoff suitable for business deployment. Feature analysis reveals actionable insights:

Product engagement is the most controllable predictor
Geographic disparities require localized strategies
Zero-balance accounts need proactive engagement
Age-based segmentation enables targeted interventions

Expected Impact: Implementing model-driven retention strategies could reduce churn by 30-40%, translating to significant revenue protection and improved customer lifetime value.

Dataset

Source: Banking Customer Churn Dataset
Records: 10,000 customers
Period: Cross-sectional snapshot
Geography: France, Germany, Spain

Author: Shree Koshti
Contact: Email | LinkedIn | GitHub

This project demonstrates the application of machine learning to solve real-world business problems in the banking sector, providing actionable insights for customer retention strategy.

Shreek195/bank-customer-churn-prediction-hpo