alihassanml/Credit-Risk-Modeling-Using-AutoML
This project demonstrates the use of AutoML and Explainable AI (XAI) to predict credit default risk for a large-scale dataset. The goal is to build an interpretable machine learning model using AutoGluon, XGBoost, and LightGBM with SHAP for model interpretability.
Credit Risk Modeling Using AutoML
This project demonstrates the use of AutoML and Explainable AI (XAI) to predict credit default risk for a large-scale dataset. The goal is to build an interpretable machine learning model using AutoGluon, XGBoost, and LightGBM with SHAP for model interpretability.
๐ Dataset
The dataset used in this project is the UCI Credit Card Default Dataset available at UCI Repository.
- Number of records: 30,000
- Number of features: 23 features related to credit, demographics, and payment history.
- Target: Whether the client will default on their payment next month (
1= Default,0= No Default).
๐ง Technologies
- LightGBM: Gradient Boosting decision trees used for model training.
- XGBoost: Another powerful gradient boosting model.
- AutoGluon: AutoML library for easy model training and hyperparameter tuning.
- SHAP: SHapley Additive exPlanations for model interpretability.
- Apache Spark: Distributed processing for handling large-scale datasets.
๐ ๏ธ Installation
-
Clone the Repository:
git clone https://github.com/alihassanml/Credit-Risk-Modeling-Using-AutoML.git cd Credit-Risk-Modeling-Using-AutoML -
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On Windows, use venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
๐ Running the Project
-
Prepare Data:
The dataset is already available. If you want to download it manually, visit UCI Credit Card Default Dataset. -
Preprocessing and Feature Engineering:
Run the following script to clean and process the data:python preprocess.py
-
Train Models:
The model training pipeline is automated using AutoGluon and LightGBM. To train the models, use:python train_model.py
-
Model Explainability:
After training, SHAP will be used to generate interpretability plots:python explain_model.py
-
Evaluate the Model:
You can evaluate the trained model by running:python evaluate_model.py
๐ Directory Structure
.
โโโ data/ # Store raw and processed data here
โโโ src/ # Source code for preprocessing, training, etc.
โ โโโ preprocess.py # Data preprocessing and feature engineering
โ โโโ train_model.py # AutoML model training (AutoGluon, LightGBM, XGBoost)
โ โโโ evaluate_model.py # Evaluate the trained model
โ โโโ explain_model.py # Model explainability using SHAP
โ โโโ feature_engineering.py # Custom feature engineering (AutoGluon setup)
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
โโโ outputs/ # Model outputs and visualizations
๐ Key Features
- AutoML Model Selection: Uses AutoGluon to automatically select the best model and hyperparameters.
- Interpretable Results: Leverages SHAP for explainability, visualizing high-risk patterns for default prediction.
- Scalable: Utilizes Apache Spark for large dataset processing and distributed computing.
๐งโ๐ซ Example Workflow
-
Preprocess Data
Clean data and generate derived features (e.g.,AVG_PAY_DELAY,MAX_PAY_DELAY,PAYMENT_RATIO). -
Train Model
Use AutoGluon to automatically select the best models (e.g., LightGBM, XGBoost), fine-tune hyperparameters, and perform cross-validation. -
Evaluate Model
Evaluate the models using metrics such as Accuracy, AUC, Precision, and Recall. -
Model Explainability
Use SHAP to interpret model predictions, highlighting which features are most indicative of credit default risk.
๐งโ๐ป Example Code Snippets
-
Feature Engineering (
feature_engineering.py):- Includes custom feature generation like
AVG_PAY_DELAY,BILL_TOTAL, andPAYMENT_RATIO.
- Includes custom feature generation like
-
Model Training (
train_model.py):from autogluon.tabular import TabularPredictor # Load preprocessed data df = pd.read_csv("processed_data.csv") # Train AutoML model predictor = TabularPredictor(label="default").fit(df)
-
Model Explainability with SHAP (
explain_model.py):import shap model = predictor.load("model.pkl") # Create SHAP explainer explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(df) # Plot SHAP summary shap.summary_plot(shap_values, df)
๐ Evaluation Metrics
- Accuracy: Measures overall prediction accuracy.
- AUC (Area Under the ROC Curve): Evaluates classification performance at various thresholds.
- Precision/Recall/F1 Score: Measures model performance, especially for imbalanced datasets.
โ ๏ธ Limitations
- The dataset may not fully represent all types of credit risk situations.
- The models are highly sensitive to hyperparameter tuning and require careful optimization for optimal results.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.