GitHunt
NI

niyatipatel2005/financial-fraud-detection

Machine Learning project using PySpark & Streamlit to detect financial fraud

Financial-Fraud-Detection using PySpark & MLlib

Machine Learning project using PySpark & Streamlit to detect financial fraud

A scalable solution for detecting fraudulent financial transactions using PySpark, MLlib, and a simple web interface for real-time prediction.


๐Ÿ“ Description

This project aims to identify fraudulent transactions from a financial dataset using Machine Learning techniques and PySpark. Built for Big Data environments, the system is efficient and scalable.

๐Ÿ” Key Features

  1. Data Preprocessing: Cleaned large transaction dataset using PySpark.
  2. Fraud Classification: Trained a Random Forest Classifier with high precision and recall.
  3. Model Evaluation: Evaluated with AUC, precision, recall, and accuracy metrics.
  4. Feature Importance: Identified most influential features (e.g., oldbalanceOrg, amount).
  5. Predictions Export: Saved model predictions for visualization or dashboarding.
  6. Web App: Deployed a simple Flask-based interface to upload transactions and check for fraud.

๐Ÿš€ Built With

๐Ÿ“Œ Languages

  • Python

๐Ÿงฐ Libraries & Frameworks

  • PySpark โ€“ Big data processing and ML model training
  • Flask โ€“ Web app interface for fraud detection
  • Pandas โ€“ Data manipulation
  • Matplotlib โ€“ Visualizations
  • Streamlit - Dashboard

๐Ÿ› ๏ธ Getting Started

โœ… Prerequisites

  • Python 3.10+
  • Apache Spark & PySpark
  • pip for installing packages
  • Git (for cloning the repo)

๐Ÿ”ง Installation

git clone https://github.com/niyatipatel2005/financial-fraud-detection.git
cd financial-fraud-detection
pip install -r requirements.txt

Download the dataset

Download the dataset of financial fraud from the kaggle.

https://www.kaggle.com/code/eryash15/financial-fraud-detection-using-pyspark-mllib/input

๐Ÿ“Š Running the Project

1. Train the Model

Run the main notebook or training script to preprocess the dataset and train the model using Logistic Regression and Random Forest:

# Inside your Jupyter Notebook or script
Financial_Fraud_Detection.ipynb

2. Launch the Web App

cd fraud_detection_dashboard
streamlit run app.py

Then open http://localhost:8501 in your browser to test fraud predictions by uploading CSV files.

๐Ÿ–ผ๏ธ Dashboard Screenshot

Dashboard UI

๐Ÿง  Feature Importance Output

Feature Importance
oldbalanceOrg 0.5027
newbalanceDest 0.1565
amount 0.1473
newbalanceOrig 0.0973
oldbalanceDest 0.0643
type_index 0.0316

๐Ÿ“ Project Structure

financial-fraud-detection/
โ”‚
โ”œโ”€โ”€ fraudTrain.csv                   # Sample dataset 
โ”œโ”€โ”€ rf_predictions_output.csv        # Model output 
โ”œโ”€โ”€ fraud_detection_model.ipynb      # Model training notebook
โ”œโ”€โ”€ fraud_detection_dashboard/
โ”‚   โ”œโ”€โ”€ app.py                       # streamlit-based dashboard
โ”‚   โ””โ”€โ”€ rf-predictions_output.csv
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ README.md

Output Images:

Output_1

Output_2

Authors

License

This project is licensed under the [NAME HERE] License - see the LICENSE.md file for details

Acknowledgments

We would like to thank the following for their support and inspiration throughout the project:

  • Kaggle for providing the original dataset for financial fraud detection.
  • The open-source community for tools like Apache Spark, PySpark, and MLlib that made scalable machine learning possible.
  • Our mentors and professors for their guidance and motivation.
  • GitHub and VS Code for being the backbone of our development workflow.
  • The creators of educational blogs, tutorials, and Stack Overflow discussions that helped us overcome technical hurdles.

Inspiration, code snippets, etc.

niyatipatel2005/financial-fraud-detection | GitHunt