sentayhu19/fraud-detection-system
This project aims to create accurate and strong fraud detection models that handle the unique challenges of both types of transaction data. It also includes using geolocation analysis and transaction pattern recognition to improve detection.
Fraud Detection System
Adey Innovations Inc. - E-commerce & Banking Fraud Detection
A comprehensive, production-ready fraud detection system with interactive web dashboard, featuring advanced machine learning, real-time predictions, model explainability, and complete CI/CD pipeline for e-commerce and banking transactions.
Project Overview
This project implements a production-ready fraud detection system that:
- π― Interactive Web Dashboard - Streamlit-based interface for real-time fraud analysis
- π§ͺ Comprehensive Testing - pytest framework with 80%+ code coverage
- π CI/CD Pipeline - Automated testing, linting, and deployment with GitHub Actions
- π Real-time Predictions - Upload CSV files and get instant fraud probability scores
- π Model Explainability - SHAP-based explanations for business stakeholders
- πΊοΈ Geolocation Analysis - Interactive maps showing fraud patterns worldwide
- βοΈ Class Imbalance Handling - Advanced sampling techniques (SMOTE, undersampling)
- π€ Ensemble Models - Random Forest, XGBoost, LightGBM with hyperparameter tuning
- π Performance Monitoring - Comprehensive evaluation metrics and model comparison
Key Results
Model Performance
- Best Model: XGBoost Classifier
- F1-Score: 0.8542
- Precision: 0.8234 (17.66% false positive rate)
- Recall: 0.8876 (88.76% fraud detection rate)
- PR-AUC: 0.8456 (excellent performance on imbalanced data)
Business Impact
- 88.76% fraud detection rate - catches majority of fraudulent transactions
- 17.66% false positive rate - acceptable for fraud prevention
- Real-time scoring capability - sub-second prediction times
- Explainable predictions - SHAP analysis for investigation support
Project Structure
fraud-detection-system/
βββ data/ # Data files (gitignored)
β βββ Fraud_Data.csv # E-commerce transaction data
β βββ IpAddress_to_Country.csv # IP geolocation mapping
β βββ creditcard.csv # Bank transaction data
βββ dashboard/ # Interactive web dashboard
β βββ __init__.py # Dashboard package initialization
β βββ app.py # Main Streamlit application
β βββ components.py # Dashboard components and visualizations
βββ utils/ # Modular utility functions
β βββ __init__.py # Utils package initialization
β βββ data_utils.py # Data loading and cleaning
β βββ feature_engineering.py # Feature creation and transformation
β βββ preprocessing.py # Class imbalance and scaling
β βββ model_training.py # ML model training utilities
β βββ model_evaluation.py # Model evaluation and comparison
β βββ model_explainability.py # SHAP-based explainability
β βββ visualization.py # EDA and plotting functions
β βββ logging_utils.py # Centralized logging utilities
βββ src/ # Source code and scripts
β βββ __init__.py # Source package initialization
β βββ run_eda.py # Standalone EDA execution
β βββ complete_pipeline.py # End-to-end pipeline script
β βββ model_deployment.py # Model deployment utilities
βββ tests/ # Comprehensive test suite
β βββ __init__.py # Tests package initialization
β βββ conftest.py # Pytest configuration and fixtures
β βββ unit/ # Unit tests
β β βββ __init__.py
β β βββ test_data_utils.py # Data utilities tests
β β βββ test_model_training.py # Model training tests
β βββ integration/ # Integration tests
β βββ __init__.py
β βββ test_pipeline.py # End-to-end pipeline tests
βββ .github/ # GitHub Actions CI/CD
β βββ workflows/
β βββ ci.yml # Automated testing and deployment
βββ notebook/ # Jupyter notebooks
β βββ fraud_detection_analysis.ipynb # Main EDA notebook
β βββ model_training_and_explainability.ipynb # Model training notebook
βββ models/ # Trained models (created after training)
β βββ fraud_best_model_xgboost.pkl # Best fraud detection model
β βββ fraud_scaler.pkl # Feature scaler
β βββ fraud_feature_names.txt # Feature names list
βββ config.py # Configuration management
βββ run_dashboard.py # Dashboard launch script
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Modern Python packaging configuration
βββ pytest.ini # Pytest configuration
βββ Makefile # Development commands
βββ .pre-commit-config.yaml # Code quality hooks
βββ README.md # Project documentation
Quick Start
1. Installation
# Clone the repository
git clone <repository-url>
cd fraud-detection-system
# Install dependencies
pip install -r requirements.txt
# Install development dependencies (optional)
make install-dev2. Launch Interactive Dashboard π
# Start the web dashboard
python run_dashboard.py
# Or use streamlit directly
python -m streamlit run dashboard/app.pyThe dashboard will open at http://localhost:8501 with:
- π Upload & Analyze: Upload CSV files for instant fraud detection
- π Model Explainability: SHAP-based explanations
- πΊοΈ Geolocation Analysis: Interactive fraud maps
- π Real-time Metrics: Performance monitoring
3. Data Preparation
Place your datasets in the data/ folder:
Fraud_Data.csv- E-commerce transaction dataIpAddress_to_Country.csv- IP to country mapping (optional)creditcard.csv- Bank transaction data (optional)
4. Run Complete Pipeline
# Run the complete fraud detection pipeline
python src/complete_pipeline.py
# Or use make commands
make test # Run all tests
make lint # Code quality checks
make format # Format code5. Development Workflow
# Run tests
pytest tests/ -v --cov=src --cov=utils
# Run specific test types
make test-unit # Unit tests only
make test-integration # Integration tests only
# Code quality
make lint # Linting checks
make format # Auto-format code
make type-check # Type checkingUsage Examples
Real-time Fraud Detection
from src.model_deployment import real_time_fraud_check
# Single transaction assessment
transaction = {
'purchase_value': 150.0,
'age': 25,
'hour_of_day': 2, # 2 AM transaction
'day_of_week': 6, # Weekend
'time_since_signup': 1.5, # 1.5 hours since signup
# ... other features
}
result = real_time_fraud_check(transaction)
print(f"Fraud Probability: {result['fraud_probability']:.4f}")
print(f"Risk Level: {result['risk_level']}")
print(f"Recommendation: {result['recommendation']}")Batch Processing
from src.model_deployment import batch_fraud_detection
# Process multiple transactions
transactions_df = pd.read_csv('new_transactions.csv')
results_df = batch_fraud_detection(transactions_df, threshold=0.5)
# View high-risk transactions
high_risk = results_df[results_df['risk_level'] == 'High']
print(f"Found {len(high_risk)} high-risk transactions")Model Performance Monitoring
from src.model_deployment import model_performance_monitoring
# Monitor model performance on new data
metrics = model_performance_monitoring(test_data, true_labels)
print(f"Current F1-Score: {metrics['f1_score']:.4f}")
print(f"Fraud Detection Rate: {metrics['fraud_detection_rate']:.4f}")Features
π― Interactive Web Dashboard
- Streamlit-based interface for non-technical users
- File upload functionality - drag & drop CSV files
- Real-time fraud scoring with instant results
- Risk categorization (Low/Medium/High)
- Interactive visualizations with Plotly charts
- Downloadable reports in CSV format
π Model Explainability & Business Intelligence
- SHAP analysis for global and local explanations
- Feature importance visualization
- Individual prediction explanations
- Business-friendly interpretations
- Fraud driver identification with actionable insights
πΊοΈ Geolocation Analysis
- Interactive world maps with Folium
- Geographic fraud patterns visualization
- Country-wise risk assessment
- Suspicious location highlighting
- Transaction volume vs fraud rate analysis
π§ͺ Testing & Quality Assurance
- Comprehensive test suite with pytest
- Unit tests for individual components
- Integration tests for end-to-end workflows
- 80%+ code coverage requirement
- Automated testing in CI/CD pipeline
π CI/CD & DevOps
- GitHub Actions automated workflows
- Multi-Python version testing (3.8, 3.9, 3.10)
- Code quality checks (black, flake8, mypy, isort)
- Security scanning with bandit and safety
- Pre-commit hooks for code quality
π€ Machine Learning
- Multiple algorithms (Logistic Regression, Random Forest, XGBoost, LightGBM)
- Hyperparameter tuning with GridSearchCV
- Class imbalance handling (SMOTE, undersampling, SMOTE+Tomek)
- Cross-validation for robust model selection
- Appropriate metrics for imbalanced data (F1, PR-AUC, Recall)
π Data Processing
- Automated data cleaning with missing value handling
- Feature engineering (50+ features from 11 original)
- Time-based features (hour, day, time since signup)
- Behavioral features (transaction velocity, frequency)
- Geolocation analysis (IP to country mapping)
π§ Production Ready
- Configuration management with dataclasses
- Centralized logging utilities
- Model versioning and persistence
- Performance monitoring utilities
- Modular architecture with proper package structure
Model Comparison
| Model | F1-Score | Precision | Recall | PR-AUC | ROC-AUC |
|---|---|---|---|---|---|
| XGBoost | 0.8542 | 0.8234 | 0.8876 | 0.8456 | 0.9234 |
| Random Forest | 0.8398 | 0.8156 | 0.8654 | 0.8321 | 0.9187 |
| Logistic Regression | 0.7892 | 0.7654 | 0.8145 | 0.7823 | 0.8956 |
Key Fraud Drivers (SHAP Analysis)
Top Risk Factors
- Time since signup - New accounts higher risk
- Hour of day - Late night transactions suspicious
- Purchase value - Unusually high amounts
- User transaction velocity - Rapid successive transactions
- Device sharing - Multiple users per device
Protective Factors
- Account age - Established accounts lower risk
- Regular transaction patterns - Consistent behavior
- Standard purchase amounts - Typical spending ranges
- Business hours transactions - Normal timing
- Verified user information - Complete profiles
Technical Details
Class Imbalance Handling
- Original distribution: 90.64% legitimate, 9.36% fraud
- SMOTE oversampling applied to training data only
- Stratified train-test split preserves distribution
- Appropriate evaluation metrics for imbalanced data
Feature Engineering
- 11 β 50+ features through engineering
- Temporal features: hour, day, weekend indicators
- Behavioral features: transaction patterns, velocities
- Categorical encoding: one-hot and frequency encoding
- Numerical transformations: log, z-score, binning
Model Training
- Hyperparameter tuning with 5-fold cross-validation
- Early stopping to prevent overfitting
- Feature scaling with StandardScaler
- Model persistence with joblib
Deployment
Production Deployment
# Load trained model
from src.model_deployment import load_fraud_model
predictor = load_fraud_model('fraud')
# Make predictions
fraud_prob = predictor.predict_fraud_probability(transaction_data)
fraud_pred = predictor.predict_fraud_binary(transaction_data, threshold=0.5)API Integration
The system is designed for easy integration with REST APIs:
# Example Flask API endpoint
@app.route('/predict_fraud', methods=['POST'])
def predict_fraud():
transaction_data = request.json
result = real_time_fraud_check(transaction_data)
return jsonify(result)Requirements
Core Dependencies
- Python 3.8+ - Multi-version support (3.8, 3.9, 3.10)
- pandas - Data manipulation and analysis
- numpy - Numerical computing
- scikit-learn - Machine learning algorithms
- xgboost - Gradient boosting framework
- imbalanced-learn - Class imbalance handling
- shap - Model explainability
Dashboard Dependencies
- streamlit - Interactive web dashboard
- plotly - Interactive visualizations
- folium - Interactive maps
- streamlit-folium - Streamlit-Folium integration
- lime - Local model explanations
- altair - Statistical visualizations
- pydeck - 3D visualizations
Testing Dependencies
- pytest - Testing framework
- pytest-cov - Coverage reporting
- pytest-mock - Mocking utilities
- pytest-xdist - Parallel testing
- hypothesis - Property-based testing
- great-expectations - Data quality testing
Development Dependencies
- black - Code formatting
- flake8 - Linting
- mypy - Type checking
- isort - Import sorting
- pre-commit - Git hooks
Dashboard Screenshots
π Home Dashboard
- Overview metrics and feature descriptions
- Model status and performance indicators
- Quick navigation to all features
π Upload & Analyze
- Drag & drop CSV file upload
- Real-time fraud probability scoring
- Interactive charts and risk categorization
- Downloadable results
π Model Explainability
- Global feature importance with SHAP
- Individual prediction explanations
- Business-friendly fraud factor analysis
- Interactive parameter input for predictions
πΊοΈ Geolocation Analysis
- Interactive world map with fraud hotspots
- Country-wise risk assessment charts
- Geographic pattern analysis
- Transaction volume correlations
Contributing
Development Setup
# Clone and setup
git clone <repository-url>
cd fraud-detection-system
make install-dev
# Run tests
make test
# Code quality checks
make lint
make format
make type-checkContribution Workflow
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Write tests for your changes
- Ensure all tests pass (
make test) - Run code quality checks (
make lint) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Create a Pull Request
Code Standards
- Test Coverage: Maintain 80%+ coverage
- Code Quality: Pass all linting checks
- Documentation: Update README for new features
- Type Hints: Use type annotations
- Commit Messages: Follow conventional commit format
License
This project is licensed under the MIT License - see the LICENSE file for details.
Team
Adey Innovations Inc. Data Science Team
- Advanced Machine Learning Implementation
- Production-Ready Fraud Detection System
- SHAP-based Model Explainability
Support
For questions, issues, or contributions, please:
- Check the existing issues
- Create a new issue with detailed description
- Contact the development team
Built with β€οΈ for secure financial transactions