higorfct/MLOps-with-Naive-Bayes
Projeto de MLops com MLflow
๐ MLOps Project with Gaussian Naive Bayes
This repository presents a complete MLOps (Machine Learning Operations) pipeline, covering everything from data preparation to deployment and monitoring of a machine learning model in production.
๐ How to use
- Clone the repository and install dependencies:
git clone https://github.com/higorfct/MLOps-with-Naive-Bayes/tree/main cd MLOps-with-Naive-Bayes pip install -r requirements.txt
๐ Objective
Automate and scale the lifecycle of a Gaussian Naive Bayes (GNB) model with hyperparameters using modern MLOps practices, such as model versioning, experiment tracking, deployment via API, and reproducibility of results.
We apply this MLOps pipeline to the Credit.csv dataset from a German bank to predict customers who are good or bad payers.
๐ค Why Gaussian Naive Bayes?
GNB was chosen because it is simple, fast, and effective for classification tasks with approximately continuous variables. It serves as a strong baseline model for problems with data that follow (or approximate) a normal distribution.
Additionally:
- Low computational cost
- Easy interpretability
- Suitable for initial pipeline deployment
โ๏ธ Technologies and Tools Used
- Python
- MLflow โ Experiment tracking, model registry, deployment
- MLflow.sklearn โ Integration between Scikit-learn and MLflow
- Scikit-learn โ Training, evaluation (accuracy, F1, AUC), splitting data
- Pandas โ Data manipulation and analysis
- NumPy โ Numerical operations
- Matplotlib โ Data visualization (ROC, confusion matrix)
- Seaborn โ Advanced statistical visualizations (heatmaps)
- Requests โ HTTP requests for API consumption
These technologies cover the entire ML pipeline: data prep โ training โ evaluation โ visualization โ deployment โ monitoring.
๐ง MLOps Pipeline Steps
-
๐ Data Analysis and Preparation
- Data cleaning and transformation
- Train/test split
-
๐๏ธ Model Training
- Gaussian Naive Bayes with
var_smoothing - Logging experiments in MLflow (parameters, metrics, artifacts)
- Gaussian Naive Bayes with
-
๐ฆ Versioning and Registration
- Model registered in MLflow Registry
-
๐ Model Deployment
- Deployed locally as an MLflow service with HTTP endpoint for REST predictions
-
๐ Monitoring and Re-evaluation
- Periodic revalidation with new data
- Production metrics logging
โ Results
The model achieved the following results on the test dataset:
| Metric | Value |
|---|---|
| Accuracy | 0.6967 |
| AUC | 0.6600 |
| F1 Score | 0.7719 |
| Log Loss | 10.9332 |
| Precision | 0.7739 |
| Recall | 0.7700 |
Interpretation:
- Accuracy (~69.7%) โ ~7 out of 10 predictions correct
- AUC (0.66) โ moderate class discrimination ability
- F1 Score (0.77) โ good balance between precision & recall
- Precision (0.77) โ 77% of positive predictions were correct
- Recall (0.77) โ 77% of actual positive cases detected
- Log Loss (10.93) โ high value, indicating probability estimates could be improved
๐ Conclusions
- The Gaussian Naive Bayes algorithm proved to be a solid baseline model for the credit classification problem, reaching balanced performance in precision and recall.
- However, the AUC and Log Loss suggest that the model still struggles to provide highly reliable probability estimates.
- For future iterations, improvements could include:
- Applying cross-validation for more robust training.
- Testing ensemble methods (Random Forest, Gradient Boosting) or regularized models (Logistic Regression with penalty).
- From an MLOps perspective, the project successfully demonstrates:
- Full lifecycle automation
- Experiment tracking and versioning with MLflow
- Reproducibility and deployment of models into production-like environments
๐ This pipeline provides a scalable and reproducible foundation for deploying and monitoring ML models in real-world financial applications.