GitHunt

Credit Scoring

A machine learning credit scoring system deployed as a REST API using FastAPI.

This project demonstrates an end-to-end credit risk modeling pipeline, including feature engineering, feature selection, model training, and real-time scoring via API.


Project Overview

Financial institutions use credit scoring models to estimate the probability that a borrower will default.

This project simulates a production-style credit scoring system that:

  • conducts Exploratory Data Analysis (EDA)
  • performs feature selection using Information Value (IV)
  • filters correlated variables
  • binning with WOE (apply monotonic)
  • trains a machine learning model using Optuna
  • analyzes machine learning model and output
  • deploys the model as a FastAPI scoring service

The API receives customer information and returns a credit risk score.

development_pipeline

Tech Stack

  • Python (Pandas, Numpy, Matplotlib)
  • Optbinning
  • Scikit-learn
  • Optuna
  • FastAPI
  • Uvicorn

Project Structure

credit-score
├── app                     # FastAPI application for model serving
│   ├── main.py             # API entry point and endpoint definitions
│   └── scoring.py          # Inference pipeline: load artifacts, transform input, generate score
│
├── data
│   ├── artifacts           # Serialized objects used for model inference
│   │   ├── feature_use.pkl     # List of features used by the trained model
│   │   ├── logistic_model.pkl  # Trained logistic regression model
│   │   └── optbinning.pkl      # Optimal binning object for WOE transformation
│
│   ├── interim             # Intermediate datasets during data preparation
│   │   ├── 10K_Lending_Club_Loans_final_features.csv
│   │   └── 10K_Lending_Club_Loans.csv
│
│   ├── predicted           # Model prediction outputs
│   │   ├── data10K_Lending_Club_Loans_predicted_bin.csv
│   │   └── data10K_Lending_Club_Loans_predicted.csv
│
│   ├── processed           # Cleaned and transformed datasets
│   │   └── 10K_Lending_Club_Loans_optbinning.csv
│
│   └── raw                 # Original dataset before preprocessing
│       └── 10K_Lending_Club_Loans.csv
│
├── Dockerfile              # Container configuration for deploying the API
│
├── notebook                # Research and experimentation notebooks
│   ├── 01_data_exploration.ipynb   # Exploratory Data Analysis (EDA)
│   ├── 02_feature_selection.ipynb  # Feature importance & IV analysis
│   ├── 03_model_training.ipynb     # Model development and training
│   ├── 04_model_analysis.ipynb     # Model evaluation and diagnostics
│   └── 05_deployment.ipynb         # Testing inference and API deployment
│
├── README.md               # Project documentation
├── requirements.txt        # Python dependencies
│
└── src                     # Core machine learning pipeline implementation
    ├── __init__.py
│
    ├── data_eda            # Data exploration utilities
    │   ├── data_eda.py         # Functions for statistical exploration
    │   └── plot.py             # Visualization functions for EDA
│
    ├── features            # Feature engineering and transformation
    │   └── features_eng.py     # Feature creation, binning, WOE transformation
│
    ├── models              # Model training and evaluation logic
    │   ├── model_analysis.py   # Model diagnostics (ROC, KS, confusion matrix)
    │   └── model_training.py   # Training pipeline for logistic regression model
│
    ├── pipeline            # End-to-end ML workflow orchestration
│
    └── utils               # Shared utilities
        └── logger.py           # Logging configuration for pipeline tracking

Installation

Clone the repository

git clone https://github.com/sirawitjariya-png/credit-score.git

cd credit-score

Create virtual environment

python -m venv venv

Activate environment

source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Dataset

10K_Lending_Club_Loans.csv

Target Variable

is_bad

  • 1 = bad loan (default)
  • 0 = good loan (fully paid)
    This is the target variable for your credit scoring model.




Dataset Feature Description

Column Name Description
loan_amnt The amount of the loan applied for by the borrower.
funded_amnt The total amount funded by investors for the loan.
term Loan repayment period, typically 36 or 60 months.
int_rate Interest rate assigned to the loan, reflecting borrower risk level.
installment Monthly payment amount owed by the borrower.
grade Lending Club credit grade ranging from A (low risk) to G (high risk).
sub_grade A more granular sub-category of the credit grade (e.g., A1, B3).
emp_title Borrower's job title.
emp_length Length of employment (e.g., <1 year, 3 years, 10+ years).
home_ownership Housing status such as RENT, OWN, or MORTGAGE.
annual_inc Annual income reported by the borrower.
verification_status Indicates whether the borrower's income has been verified.
purpose Purpose of the loan (e.g., debt consolidation, credit card, car).
title Borrower-provided title describing the loan purpose.
zip_code Borrower ZIP code (partially masked).
addr_state State where the borrower resides.
dti Debt-to-income ratio calculated as total monthly debt divided by income.
delinq_2yrs Number of delinquent accounts in the last two years.
earliest_cr_line Date of the borrower’s earliest credit account.
inq_last_6mths Number of credit inquiries made in the last six months.
mths_since_last_delinq Months since the borrower's last delinquency.
mths_since_last_record Months since the borrower's last public record (e.g., bankruptcy).
open_acc Number of open credit accounts.
pub_rec Number of derogatory public records.
revol_bal Total revolving credit balance.
revol_util Revolving credit utilization rate (used credit / credit limit).
total_acc Total number of credit accounts ever opened.
initial_list_status Loan listing status (f = fractional, w = whole).
policy_code Internal policy code used by Lending Club.

Run Data Exploration (Optional)

notebook/01_data_exploration.ipynb

This step explores:

  • missing values
  • feature distribution
  • target imbalance
  • correlation analysis

Run Feature Selection and Pre-processing

notebook/02_feature_selection.ipynb

This stage performs:

  • Information Value (IV) calculation
  • correlation filtering
  • feature importance analysis
  • Weight of Evidence (WOE) transformation
  • monotonic processing

Output:

data/processed/
└── 10K_Lending_Club_Loans_optbinning.csv
data/artifacts/
└── optbinning.pkl

Train the Credit Scoring Model

notebook/03_model_training.ipynb

This stage performs:

  • Model training using optuna
    • Logistic Regression
    • XGBoost
    • LGBM
    • CatBoost
  • Best model selection

Output:

data/processed/
├── data10K_Lending_Club_Loans_predicted.csv
└── data10K_Lending_Club_Loans_predicted_bin.csv
data/artifacts/
├── feature_use.pkl
├── logistic_model.pkl

Deploy docker file

Build container

docker build -t credit-scoring .

Check image

docker images

Run Container

docker run -p 8000:8000 credit-scoring

Running the API

Start the FastAPI server

API will run at http://127.0.0.1:8000

Interactive documentation http://127.0.0.1:8000/docs


API Endpoints

Health Check

GET /

Response

{
"message": "Credit Scoring API is running"
}

Credit Score Prediction

POST /score

Example Request

{'data': 
    [
        {
            'loan_amnt': 4000,
            'term': ' 60 months',
            'int_rate': '7.29%',
            'grade': 'A',
            'annual_inc': 50000.0,
            'verification_status': 'not verified',
            'purpose': 'medical',
            'inq_last_6mths': 0.0,
            'revol_util': 12.1,
            'total_acc': 44.0,
            'installment': 79.76
        }
    ]
}

Example Response

{
"scores": [0.09566960804156632]
}

Author

Sirawit Jariyapongpaiboon

Data Scientist specializing in machine learning and credit risk modeling.

GitHub: https://github.com/sirawitjariya-png

sirawitjariya-png/credit-score | GitHunt