niyatipatel2005/financial-fraud-detection
Machine Learning project using PySpark & Streamlit to detect financial fraud
Financial-Fraud-Detection using PySpark & MLlib
Machine Learning project using PySpark & Streamlit to detect financial fraud
A scalable solution for detecting fraudulent financial transactions using PySpark, MLlib, and a simple web interface for real-time prediction.
๐ Description
This project aims to identify fraudulent transactions from a financial dataset using Machine Learning techniques and PySpark. Built for Big Data environments, the system is efficient and scalable.
๐ Key Features
- Data Preprocessing: Cleaned large transaction dataset using PySpark.
- Fraud Classification: Trained a Random Forest Classifier with high precision and recall.
- Model Evaluation: Evaluated with AUC, precision, recall, and accuracy metrics.
- Feature Importance: Identified most influential features (e.g., oldbalanceOrg, amount).
- Predictions Export: Saved model predictions for visualization or dashboarding.
- Web App: Deployed a simple Flask-based interface to upload transactions and check for fraud.
๐ Built With
๐ Languages
๐งฐ Libraries & Frameworks
โ Big data processing and ML model training
โ Web app interface for fraud detection
- Pandas โ Data manipulation
- Matplotlib โ Visualizations
- Streamlit - Dashboard
๐ ๏ธ Getting Started
โ Prerequisites
- Python 3.10+
- Apache Spark & PySpark
- pip for installing packages
- Git (for cloning the repo)
๐ง Installation
git clone https://github.com/niyatipatel2005/financial-fraud-detection.git
cd financial-fraud-detection
pip install -r requirements.txtDownload the dataset
Download the dataset of financial fraud from the kaggle.
https://www.kaggle.com/code/eryash15/financial-fraud-detection-using-pyspark-mllib/input๐ Running the Project
1. Train the Model
Run the main notebook or training script to preprocess the dataset and train the model using Logistic Regression and Random Forest:
# Inside your Jupyter Notebook or script
Financial_Fraud_Detection.ipynb2. Launch the Web App
cd fraud_detection_dashboard
streamlit run app.pyThen open http://localhost:8501 in your browser to test fraud predictions by uploading CSV files.
๐ผ๏ธ Dashboard Screenshot
๐ง Feature Importance Output
| Feature | Importance |
|---|---|
| oldbalanceOrg | 0.5027 |
| newbalanceDest | 0.1565 |
| amount | 0.1473 |
| newbalanceOrig | 0.0973 |
| oldbalanceDest | 0.0643 |
| type_index | 0.0316 |
๐ Project Structure
financial-fraud-detection/
โ
โโโ fraudTrain.csv # Sample dataset
โโโ rf_predictions_output.csv # Model output
โโโ fraud_detection_model.ipynb # Model training notebook
โโโ fraud_detection_dashboard/
โ โโโ app.py # streamlit-based dashboard
โ โโโ rf-predictions_output.csv
โโโ requirements.txt
โโโ README.mdOutput Images:
Authors
- Niyati Patel - https://github.com/niyatipatel2005
License
This project is licensed under the [NAME HERE] License - see the LICENSE.md file for details
Acknowledgments
We would like to thank the following for their support and inspiration throughout the project:
- Kaggle for providing the original dataset for financial fraud detection.
- The open-source community for tools like Apache Spark, PySpark, and MLlib that made scalable machine learning possible.
- Our mentors and professors for their guidance and motivation.
- GitHub and VS Code for being the backbone of our development workflow.
- The creators of educational blogs, tutorials, and Stack Overflow discussions that helped us overcome technical hurdles.
Inspiration, code snippets, etc.


