Soumyapro/diabetes-prediction-knn
A machine learning project that predicts diabetes diagnosis using K-Nearest Neighbors (KNN) classification with SMOTE-based class balancing.
Diabetes Prediction - KNN Classification with SMOTE
Overview
This project builds a machine learning model to predict diabetes diagnosis using K-Nearest Neighbors (KNN) classification. It implements SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance and improve detection of positive diabetes cases.
Project Highlights
- Algorithm: K-Nearest Neighbors (KNN) with k=13
- Accuracy: 73-78% depending on balancing approach
- Key Achievement: 73% recall for diabetes detection (improved from 53% without SMOTE)
- False Negatives Reduced: From 35 to 20 cases with SMOTE balancing
Dataset
Source: Diabetes dataset (769 samples)
Features (8 clinical indicators):
- Pregnancies
- Glucose level
- Blood Pressure
- Skin Thickness
- Insulin level
- BMI (Body Mass Index)
- Diabetes Pedigree Function
- Age
Target: Outcome (Binary: 0 = Non-diabetic, 1 = Diabetic)
Project Structure
├── diabetes_prediction.ipynb # Main Jupyter notebook with complete analysis
├── diabetes.csv # Dataset file
└── README.md
Key Steps
1. Exploratory Data Analysis (EDA)
- Data shape and info analysis
- Missing values and duplicates check
- Distribution analysis with histograms
- Boxplots for outlier detection
- Correlation matrix heatmap
- Pair plots by outcome class
2. Data Preprocessing
- Feature scaling using StandardScaler
- Train-test split (70-30)
3. Model Development
- KNN hyperparameter optimization (k=1 to 14)
- Identified optimal k=13 based on test score
4. Class Imbalance Handling
- Applied SMOTE to balance training data
- Compared model performance before and after
5. Model Evaluation
- Confusion matrix
- Classification report (precision, recall, F1-score)
- Performance metrics comparison
Results Comparison
Before SMOTE
- Recall (Diabetic Class): 53%
- Precision (Diabetic Class): 71%
- Overall Accuracy: 78%
- False Negatives: 35
After SMOTE
- Recall (Diabetic Class): 73%
- Precision (Diabetic Class): 56%
- Overall Accuracy: 73%
- False Negatives: 20
Insight: SMOTE significantly improves disease detection (recall), reducing missed cases by 43%. In medical contexts, reducing false negatives is often prioritized to ensure cases don't go undetected.
Requirements
numpy
pandas
matplotlib
seaborn
scikit-learn
imbalanced-learn
Model Performance Metrics
Confusion Matrix (with SMOTE):
True Negatives: 95
False Positives: 33
False Negatives: 20
True Positives: 53
Classification Report:
- Non-diabetic (Class 0): 74% precision, 74% recall
- Diabetic (Class 1): 56% precision, 73% recall
Key Learnings
- Class imbalance is a critical issue in medical datasets
- SMOTE effectively handles imbalanced data by creating synthetic samples
- Trade-off between precision and recall depending on domain requirements
- In healthcare, higher recall is often preferred to minimize missed diagnoses