Student Performance Prediction – End-to-End ML Project

This project predicts students’ math performance based on demographic and academic attributes such as gender, parental education, lunch type, test preparation, and reading/writing scores.
The dataset is sourced from Kaggle – Students Performance in Exams

Live on AWS EB: Click Here

Objective

To build a machine learning pipeline that can:

Clean and preprocess the data
Perform feature engineering and transformation
Train multiple ML models and optimize hyperparameters
Save and deploy the best-performing model using a scalable and reusable pipeline

Project Structure

ML_Project_1/
│
├── artifacts/                     # Stores serialized models and processed data
│   ├── preprocessor.pkl
│   ├── train.csv
│   ├── test.csv
│   ├── raw.csv
│
├── logs/                          # Application logs
│
├── notebook/                      # Jupyter notebooks for EDA & experimentation
│   ├── 1. EDA STUDENT PERFORMANCE.ipynb
│   ├── 2. MODEL TRAINING.ipynb
│   └── data/data.csv
│
├── src/
│   ├── components/                # Core ML components
│   │   ├── data_ingestion.py
│   │   ├── data_transformation.py
│   │   ├── model_trainer.py             (to be added)
│   │   └── model_hyperparameter_tuning.py (to be added)
│   │
│   ├── pipeline/                  # Training & prediction pipelines
│   │   ├── train_pipeline.py
│   │   ├── predict_pipeline.py
│   │   └── __init__.py
│   │
│   ├── logs/                      # Logging config files
│   ├── exception.py               # Custom exception handling
│   ├── logger.py                  # Logging utilities
│   ├── utils.py                   # Helper functions (e.g., model saving/loading)
│   └── __init__.py
│
├── venv/                          # Virtual environment
│
├── .gitignore
├── requirements.txt               # Required dependencies
├── setup.py                       # Package configuration
└── README.md

Tech Stack

Language: Python 3.10+
Libraries: numpy, pandas, scikit-learn, matplotlib, seaborn, joblib
Frameworks: Flask (for deployment)
Cloud: AWS EC2, Azure Container Instance (planned)
Version Control: Git + GitHub
Environment: Virtual Environment / Conda

Key Modules

1️⃣ Data Ingestion

Loads raw data from notebook/data/data.csv
Splits data into train/test sets
Saves the processed datasets in artifacts/

2️⃣ Data Transformation

Handles missing values with SimpleImputer
Encodes categorical features using OneHotEncoder
Scales features using StandardScaler
Saves preprocessor object (preprocessor.pkl)

3️⃣ Model Trainer (upcoming)

Trains multiple ML models (e.g., Linear Regression, RandomForest, XGBoost)
Evaluates metrics (R², RMSE, MAE)
Saves the best-performing model

4️⃣ Hyperparameter Tuning (upcoming)

Uses GridSearchCV or RandomizedSearchCV for model optimization

5️⃣ Prediction Pipeline (upcoming)

Loads saved preprocessor and model to predict unseen data

Training the Pipeline

# Step 1: Activate environment
venv\Scripts\activate

# Step 2: Install dependencies
pip install -r requirements.txt

# Step 3: Run Data Ingestion
python src/components/data_ingestion.py

# Step 4: Run Data Transformation
python src/components/data_transformation.py

# Step 5: Run Model Training (when implemented)
python src/components/model_trainer.py

Deployment (Planned)

AWS EC2

Containerize the application using Docker
Deploy the Flask app and trained ML model on an EC2 instance
Use NGINX or Gunicorn for serving the production app

Azure Container Instance

Deploy using Azure CLI or the Azure Portal
Build Docker image and push it to Azure Container Registry (ACR)
Run and scale the containerized app directly on Azure

Results

Model	R² Score	RMSE	MAE
Linear Regression	0.88	5.40	4.22
Lasso	0.83	6.52	5.16

Utilities

Custom Logging: Provides detailed tracking of every step in the ML workflow
Custom Exception Handling: Ensures robust and clean error management
Reusable Pipelines: Modularized preprocessing and model training pipelines for flexibility

Author

Mayank Meghwal
Data Scientist | Machine Learning Engineer

Email: mayankmeg207@gmail.com
GitHub: itz-Mayank

Future Enhancements

Implement CI/CD pipeline with GitHub Actions
Automate deployment using Docker and Kubernetes
Integrate model monitoring and automated retraining system
Add support for multi-cloud deployment (AWS + Azure + GCP)

License

This project is open-source and available under the MIT License.

itz-Mayank/ml_project1