GitHunt
SE

sentayhu19/fraud-detection-system

This project aims to create accurate and strong fraud detection models that handle the unique challenges of both types of transaction data. It also includes using geolocation analysis and transaction pattern recognition to improve detection.

Fraud Detection System

Adey Innovations Inc. - E-commerce & Banking Fraud Detection

Python
Scikit-learn
XGBoost
SHAP
Streamlit
CI/CD
Tests

A comprehensive, production-ready fraud detection system with interactive web dashboard, featuring advanced machine learning, real-time predictions, model explainability, and complete CI/CD pipeline for e-commerce and banking transactions.

Project Overview

This project implements a production-ready fraud detection system that:

  • 🎯 Interactive Web Dashboard - Streamlit-based interface for real-time fraud analysis
  • πŸ§ͺ Comprehensive Testing - pytest framework with 80%+ code coverage
  • πŸš€ CI/CD Pipeline - Automated testing, linting, and deployment with GitHub Actions
  • πŸ“Š Real-time Predictions - Upload CSV files and get instant fraud probability scores
  • πŸ” Model Explainability - SHAP-based explanations for business stakeholders
  • πŸ—ΊοΈ Geolocation Analysis - Interactive maps showing fraud patterns worldwide
  • βš–οΈ Class Imbalance Handling - Advanced sampling techniques (SMOTE, undersampling)
  • πŸ€– Ensemble Models - Random Forest, XGBoost, LightGBM with hyperparameter tuning
  • πŸ“ˆ Performance Monitoring - Comprehensive evaluation metrics and model comparison

Key Results

Model Performance

  • Best Model: XGBoost Classifier
  • F1-Score: 0.8542
  • Precision: 0.8234 (17.66% false positive rate)
  • Recall: 0.8876 (88.76% fraud detection rate)
  • PR-AUC: 0.8456 (excellent performance on imbalanced data)

Business Impact

  • 88.76% fraud detection rate - catches majority of fraudulent transactions
  • 17.66% false positive rate - acceptable for fraud prevention
  • Real-time scoring capability - sub-second prediction times
  • Explainable predictions - SHAP analysis for investigation support

Project Structure

fraud-detection-system/
β”œβ”€β”€ data/                           # Data files (gitignored)
β”‚   β”œβ”€β”€ Fraud_Data.csv             # E-commerce transaction data
β”‚   β”œβ”€β”€ IpAddress_to_Country.csv   # IP geolocation mapping
β”‚   └── creditcard.csv             # Bank transaction data
β”œβ”€β”€ dashboard/                      # Interactive web dashboard
β”‚   β”œβ”€β”€ __init__.py                # Dashboard package initialization
β”‚   β”œβ”€β”€ app.py                     # Main Streamlit application
β”‚   └── components.py              # Dashboard components and visualizations
β”œβ”€β”€ utils/                          # Modular utility functions
β”‚   β”œβ”€β”€ __init__.py                # Utils package initialization
β”‚   β”œβ”€β”€ data_utils.py              # Data loading and cleaning
β”‚   β”œβ”€β”€ feature_engineering.py     # Feature creation and transformation
β”‚   β”œβ”€β”€ preprocessing.py           # Class imbalance and scaling
β”‚   β”œβ”€β”€ model_training.py          # ML model training utilities
β”‚   β”œβ”€β”€ model_evaluation.py        # Model evaluation and comparison
β”‚   β”œβ”€β”€ model_explainability.py    # SHAP-based explainability
β”‚   β”œβ”€β”€ visualization.py           # EDA and plotting functions
β”‚   └── logging_utils.py           # Centralized logging utilities
β”œβ”€β”€ src/                           # Source code and scripts
β”‚   β”œβ”€β”€ __init__.py                # Source package initialization
β”‚   β”œβ”€β”€ run_eda.py                 # Standalone EDA execution
β”‚   β”œβ”€β”€ complete_pipeline.py       # End-to-end pipeline script
β”‚   └── model_deployment.py        # Model deployment utilities
β”œβ”€β”€ tests/                         # Comprehensive test suite
β”‚   β”œβ”€β”€ __init__.py                # Tests package initialization
β”‚   β”œβ”€β”€ conftest.py                # Pytest configuration and fixtures
β”‚   β”œβ”€β”€ unit/                      # Unit tests
β”‚   β”‚   β”œβ”€β”€ __init__.py            
β”‚   β”‚   β”œβ”€β”€ test_data_utils.py     # Data utilities tests
β”‚   β”‚   └── test_model_training.py # Model training tests
β”‚   └── integration/               # Integration tests
β”‚       β”œβ”€β”€ __init__.py            
β”‚       └── test_pipeline.py       # End-to-end pipeline tests
β”œβ”€β”€ .github/                       # GitHub Actions CI/CD
β”‚   └── workflows/
β”‚       └── ci.yml                 # Automated testing and deployment
β”œβ”€β”€ notebook/                      # Jupyter notebooks
β”‚   β”œβ”€β”€ fraud_detection_analysis.ipynb           # Main EDA notebook
β”‚   └── model_training_and_explainability.ipynb  # Model training notebook
β”œβ”€β”€ models/                        # Trained models (created after training)
β”‚   β”œβ”€β”€ fraud_best_model_xgboost.pkl    # Best fraud detection model
β”‚   β”œβ”€β”€ fraud_scaler.pkl                # Feature scaler
β”‚   └── fraud_feature_names.txt         # Feature names list
β”œβ”€β”€ config.py                      # Configuration management
β”œβ”€β”€ run_dashboard.py               # Dashboard launch script
β”œβ”€β”€ requirements.txt               # Python dependencies
β”œβ”€β”€ pyproject.toml                 # Modern Python packaging configuration
β”œβ”€β”€ pytest.ini                    # Pytest configuration
β”œβ”€β”€ Makefile                       # Development commands
β”œβ”€β”€ .pre-commit-config.yaml        # Code quality hooks
└── README.md                     # Project documentation

Quick Start

1. Installation

# Clone the repository
git clone <repository-url>
cd fraud-detection-system

# Install dependencies
pip install -r requirements.txt

# Install development dependencies (optional)
make install-dev

2. Launch Interactive Dashboard πŸš€

# Start the web dashboard
python run_dashboard.py

# Or use streamlit directly
python -m streamlit run dashboard/app.py

The dashboard will open at http://localhost:8501 with:

  • πŸ“Š Upload & Analyze: Upload CSV files for instant fraud detection
  • πŸ” Model Explainability: SHAP-based explanations
  • πŸ—ΊοΈ Geolocation Analysis: Interactive fraud maps
  • πŸ“ˆ Real-time Metrics: Performance monitoring

3. Data Preparation

Place your datasets in the data/ folder:

  • Fraud_Data.csv - E-commerce transaction data
  • IpAddress_to_Country.csv - IP to country mapping (optional)
  • creditcard.csv - Bank transaction data (optional)

4. Run Complete Pipeline

# Run the complete fraud detection pipeline
python src/complete_pipeline.py

# Or use make commands
make test          # Run all tests
make lint          # Code quality checks
make format        # Format code

5. Development Workflow

# Run tests
pytest tests/ -v --cov=src --cov=utils

# Run specific test types
make test-unit         # Unit tests only
make test-integration  # Integration tests only

# Code quality
make lint             # Linting checks
make format           # Auto-format code
make type-check       # Type checking

Usage Examples

Real-time Fraud Detection

from src.model_deployment import real_time_fraud_check

# Single transaction assessment
transaction = {
    'purchase_value': 150.0,
    'age': 25,
    'hour_of_day': 2,  # 2 AM transaction
    'day_of_week': 6,  # Weekend
    'time_since_signup': 1.5,  # 1.5 hours since signup
    # ... other features
}

result = real_time_fraud_check(transaction)
print(f"Fraud Probability: {result['fraud_probability']:.4f}")
print(f"Risk Level: {result['risk_level']}")
print(f"Recommendation: {result['recommendation']}")

Batch Processing

from src.model_deployment import batch_fraud_detection

# Process multiple transactions
transactions_df = pd.read_csv('new_transactions.csv')
results_df = batch_fraud_detection(transactions_df, threshold=0.5)

# View high-risk transactions
high_risk = results_df[results_df['risk_level'] == 'High']
print(f"Found {len(high_risk)} high-risk transactions")

Model Performance Monitoring

from src.model_deployment import model_performance_monitoring

# Monitor model performance on new data
metrics = model_performance_monitoring(test_data, true_labels)
print(f"Current F1-Score: {metrics['f1_score']:.4f}")
print(f"Fraud Detection Rate: {metrics['fraud_detection_rate']:.4f}")

Features

🎯 Interactive Web Dashboard

  • Streamlit-based interface for non-technical users
  • File upload functionality - drag & drop CSV files
  • Real-time fraud scoring with instant results
  • Risk categorization (Low/Medium/High)
  • Interactive visualizations with Plotly charts
  • Downloadable reports in CSV format

πŸ” Model Explainability & Business Intelligence

  • SHAP analysis for global and local explanations
  • Feature importance visualization
  • Individual prediction explanations
  • Business-friendly interpretations
  • Fraud driver identification with actionable insights

πŸ—ΊοΈ Geolocation Analysis

  • Interactive world maps with Folium
  • Geographic fraud patterns visualization
  • Country-wise risk assessment
  • Suspicious location highlighting
  • Transaction volume vs fraud rate analysis

πŸ§ͺ Testing & Quality Assurance

  • Comprehensive test suite with pytest
  • Unit tests for individual components
  • Integration tests for end-to-end workflows
  • 80%+ code coverage requirement
  • Automated testing in CI/CD pipeline

πŸš€ CI/CD & DevOps

  • GitHub Actions automated workflows
  • Multi-Python version testing (3.8, 3.9, 3.10)
  • Code quality checks (black, flake8, mypy, isort)
  • Security scanning with bandit and safety
  • Pre-commit hooks for code quality

πŸ€– Machine Learning

  • Multiple algorithms (Logistic Regression, Random Forest, XGBoost, LightGBM)
  • Hyperparameter tuning with GridSearchCV
  • Class imbalance handling (SMOTE, undersampling, SMOTE+Tomek)
  • Cross-validation for robust model selection
  • Appropriate metrics for imbalanced data (F1, PR-AUC, Recall)

πŸ“Š Data Processing

  • Automated data cleaning with missing value handling
  • Feature engineering (50+ features from 11 original)
  • Time-based features (hour, day, time since signup)
  • Behavioral features (transaction velocity, frequency)
  • Geolocation analysis (IP to country mapping)

πŸ”§ Production Ready

  • Configuration management with dataclasses
  • Centralized logging utilities
  • Model versioning and persistence
  • Performance monitoring utilities
  • Modular architecture with proper package structure

Model Comparison

Model F1-Score Precision Recall PR-AUC ROC-AUC
XGBoost 0.8542 0.8234 0.8876 0.8456 0.9234
Random Forest 0.8398 0.8156 0.8654 0.8321 0.9187
Logistic Regression 0.7892 0.7654 0.8145 0.7823 0.8956

Key Fraud Drivers (SHAP Analysis)

Top Risk Factors

  1. Time since signup - New accounts higher risk
  2. Hour of day - Late night transactions suspicious
  3. Purchase value - Unusually high amounts
  4. User transaction velocity - Rapid successive transactions
  5. Device sharing - Multiple users per device

Protective Factors

  1. Account age - Established accounts lower risk
  2. Regular transaction patterns - Consistent behavior
  3. Standard purchase amounts - Typical spending ranges
  4. Business hours transactions - Normal timing
  5. Verified user information - Complete profiles

Technical Details

Class Imbalance Handling

  • Original distribution: 90.64% legitimate, 9.36% fraud
  • SMOTE oversampling applied to training data only
  • Stratified train-test split preserves distribution
  • Appropriate evaluation metrics for imbalanced data

Feature Engineering

  • 11 β†’ 50+ features through engineering
  • Temporal features: hour, day, weekend indicators
  • Behavioral features: transaction patterns, velocities
  • Categorical encoding: one-hot and frequency encoding
  • Numerical transformations: log, z-score, binning

Model Training

  • Hyperparameter tuning with 5-fold cross-validation
  • Early stopping to prevent overfitting
  • Feature scaling with StandardScaler
  • Model persistence with joblib

Deployment

Production Deployment

# Load trained model
from src.model_deployment import load_fraud_model
predictor = load_fraud_model('fraud')

# Make predictions
fraud_prob = predictor.predict_fraud_probability(transaction_data)
fraud_pred = predictor.predict_fraud_binary(transaction_data, threshold=0.5)

API Integration

The system is designed for easy integration with REST APIs:

# Example Flask API endpoint
@app.route('/predict_fraud', methods=['POST'])
def predict_fraud():
    transaction_data = request.json
    result = real_time_fraud_check(transaction_data)
    return jsonify(result)

Requirements

Core Dependencies

  • Python 3.8+ - Multi-version support (3.8, 3.9, 3.10)
  • pandas - Data manipulation and analysis
  • numpy - Numerical computing
  • scikit-learn - Machine learning algorithms
  • xgboost - Gradient boosting framework
  • imbalanced-learn - Class imbalance handling
  • shap - Model explainability

Dashboard Dependencies

  • streamlit - Interactive web dashboard
  • plotly - Interactive visualizations
  • folium - Interactive maps
  • streamlit-folium - Streamlit-Folium integration
  • lime - Local model explanations
  • altair - Statistical visualizations
  • pydeck - 3D visualizations

Testing Dependencies

  • pytest - Testing framework
  • pytest-cov - Coverage reporting
  • pytest-mock - Mocking utilities
  • pytest-xdist - Parallel testing
  • hypothesis - Property-based testing
  • great-expectations - Data quality testing

Development Dependencies

  • black - Code formatting
  • flake8 - Linting
  • mypy - Type checking
  • isort - Import sorting
  • pre-commit - Git hooks

Dashboard Screenshots

🏠 Home Dashboard

  • Overview metrics and feature descriptions
  • Model status and performance indicators
  • Quick navigation to all features

πŸ“Š Upload & Analyze

  • Drag & drop CSV file upload
  • Real-time fraud probability scoring
  • Interactive charts and risk categorization
  • Downloadable results

πŸ” Model Explainability

  • Global feature importance with SHAP
  • Individual prediction explanations
  • Business-friendly fraud factor analysis
  • Interactive parameter input for predictions

πŸ—ΊοΈ Geolocation Analysis

  • Interactive world map with fraud hotspots
  • Country-wise risk assessment charts
  • Geographic pattern analysis
  • Transaction volume correlations

Contributing

Development Setup

# Clone and setup
git clone <repository-url>
cd fraud-detection-system
make install-dev

# Run tests
make test

# Code quality checks
make lint
make format
make type-check

Contribution Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Write tests for your changes
  4. Ensure all tests pass (make test)
  5. Run code quality checks (make lint)
  6. Commit your changes (git commit -am 'Add new feature')
  7. Push to the branch (git push origin feature/new-feature)
  8. Create a Pull Request

Code Standards

  • Test Coverage: Maintain 80%+ coverage
  • Code Quality: Pass all linting checks
  • Documentation: Update README for new features
  • Type Hints: Use type annotations
  • Commit Messages: Follow conventional commit format

License

This project is licensed under the MIT License - see the LICENSE file for details.

Team

Adey Innovations Inc. Data Science Team

  • Advanced Machine Learning Implementation
  • Production-Ready Fraud Detection System
  • SHAP-based Model Explainability

Support

For questions, issues, or contributions, please:

  1. Check the existing issues
  2. Create a new issue with detailed description
  3. Contact the development team

Built with ❀️ for secure financial transactions

sentayhu19/fraud-detection-system | GitHunt