FajrinCd/home-credit-default-risk-complete
Complete implementation of Home Credit Default Risk Kaggle competition. Predicts credit default using Python, scikit-learn, SHAP, and ensemble models. Includes data preprocessing, feature engineering, and submission pipeline.
Home Credit Default Risk — Complete Version
Table of Contents
- Project Description
- Key Features
- Dependencies
- Installation and Setup
- Usage
- Project Structure
- Data Folder
- Results and Models
- Contributing
- License
Project Description
This project is a complete implementation of the Kaggle competition Home Credit Default Risk.
It predicts the probability of a client defaulting on a loan based on application and credit history data.
The pipeline covers the full data science process — from data loading to submission file creation —
focusing on preprocessing, feature engineering, modeling, and evaluation.
The code is universal and can run in various environments (Google Colab, Jupyter Notebook, or local machine) without Colab dependencies. Dataset paths can be manually adjusted.
Key Features
- Data Loading – Loads all Home Credit datasets (
application_train,application_test, etc.). - Data Exploration – Inspects missing values, data types, duplicates, and anomalies.
- Preprocessing – Handles missing values, outliers (IQR capping), normalization, and one-hot encoding.
- Feature Engineering – Creates new features (e.g., income-credit ratio) and aggregates related data.
- Modeling – Trains multiple models (Logistic Regression, Decision Tree, Random Forest, etc.) with hyperparameter tuning and ensemble (Voting Classifier).
- Evaluation – Uses ROC-AUC, ROC curves, and SHAP for interpretability.
- Submission – Generates a final
submission.csvfor Kaggle upload. - Business Recommendations – Includes actionable insights for model deployment in real-world use cases.
Dependencies
Install the required Python libraries before running the code:
pip install pandas numpy matplotlib seaborn scikit-learn imbalanced-learn shapInstallation and Setup
-
Clone this repository
git clone https://github.com/FajrinCd/home-credit-default-risk-complete.git cd home-credit-default-risk-complete -
Download the dataset
Get it from Kaggle: Home Credit Default Risk
and extract all CSV files into a local folder (e.g.,/path/to/home-credit-data/). -
Run the project
You can run it in Google Colab, Jupyter Notebook, or any Python environment.
Usage
1. Run the Main Code
- Open
home_credit_model.py(or your chosen filename). - Input the dataset path when prompted (e.g.,
/path/to/home-credit-data/). - The code will process data, train models, and generate the final
submission.csv.
2. Outputs
- Intermediate files:
application_train_merged.csv,application_test_merged.csv,
application_train_featured.csv,application_test_featured.csv - Final submission:
submission.csv(for Kaggle upload) - Visualizations: ROC curve, feature importance, SHAP summary plot
3. Example Run (Terminal)
python home_credit_model.pyProject Structure
home-credit-default-risk-complete/
├── home_credit_model.py # Main script (complete pipeline)
├── requirements.txt # List of dependencies
├── README.md # Project documentation
└── data # Folder containing all CSV datasets
requirements.txt— Lists all dependencies (see Dependencies).data/(optional) — Contains all Home Credit datasets.
You can download them from the official Kaggle Competition Page.
Data Folder
This folder stores all datasets used in the Home Credit Default Risk project.
Each file provides information about client applications, credit history, and payment performance.
| File | Short Description |
|---|---|
| application_train.csv / application_test.csv | Main loan application data. TARGET indicates loan status (0 = repaid, 1 = defaulted). |
| bureau.csv | Clients’ credit history from other financial institutions. |
| bureau_balance.csv | Monthly data about previous credits from the bureau dataset. |
| previous_application.csv | Clients’ previous loan applications at Home Credit. |
| POS_CASH_BALANCE.csv | Monthly records of previous point-of-sale or cash loans. |
| credit_card_balance.csv | Monthly balance data for previous credit cards. |
| installments_payment.csv | Payment history for previous loans (both made and missed). |
Tip: Store all datasets inside the
data/folder for easier access, or adjust file paths in your code if stored elsewhere.
Results and Models
- Best Model: Random Forest or Voting Classifier (ROC-AUC ≈ 0.75+, depending on tuning).
- Evaluation: ROC curves visualize performance; SHAP highlights key predictive features like
EXT_SOURCE_2. - Recommendations: Deploy as part of a credit risk monitoring dashboard using external score features for segmentation.
Contributing
Contributions are welcome!
- Fork this repository and create a new branch for your feature or fix.
- Submit a pull request with a clear description of your changes.
- Report bugs or suggestions via GitHub Issues.
License
This project is licensed under the MIT License.
You are free to use, modify, and distribute this project for educational or non-commercial purposes.
For questions or feedback, open an issue on GitHub or contact the maintainer: dgartup@gmail.com
Happy coding! 🚀