Diabetes Prediction - KNN Classification with SMOTE

Overview

This project builds a machine learning model to predict diabetes diagnosis using K-Nearest Neighbors (KNN) classification. It implements SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance and improve detection of positive diabetes cases.

Project Highlights

Algorithm: K-Nearest Neighbors (KNN) with k=13
Accuracy: 73-78% depending on balancing approach
Key Achievement: 73% recall for diabetes detection (improved from 53% without SMOTE)
False Negatives Reduced: From 35 to 20 cases with SMOTE balancing

Dataset

Source: Diabetes dataset (769 samples)

Features (8 clinical indicators):

Pregnancies
Glucose level
Blood Pressure
Skin Thickness
Insulin level
BMI (Body Mass Index)
Diabetes Pedigree Function
Age

Target: Outcome (Binary: 0 = Non-diabetic, 1 = Diabetic)

Project Structure

├── diabetes_prediction.ipynb    # Main Jupyter notebook with complete analysis
├── diabetes.csv                 # Dataset file
└── README.md

Key Steps

1. Exploratory Data Analysis (EDA)

Data shape and info analysis
Missing values and duplicates check
Distribution analysis with histograms
Boxplots for outlier detection
Correlation matrix heatmap
Pair plots by outcome class

2. Data Preprocessing

Feature scaling using StandardScaler
Train-test split (70-30)

3. Model Development

KNN hyperparameter optimization (k=1 to 14)
Identified optimal k=13 based on test score

4. Class Imbalance Handling

Applied SMOTE to balance training data
Compared model performance before and after

5. Model Evaluation

Confusion matrix
Classification report (precision, recall, F1-score)
Performance metrics comparison

Results Comparison

Before SMOTE

Recall (Diabetic Class): 53%
Precision (Diabetic Class): 71%
Overall Accuracy: 78%
False Negatives: 35

After SMOTE

Recall (Diabetic Class): 73%
Precision (Diabetic Class): 56%
Overall Accuracy: 73%
False Negatives: 20

Insight: SMOTE significantly improves disease detection (recall), reducing missed cases by 43%. In medical contexts, reducing false negatives is often prioritized to ensure cases don't go undetected.

Requirements

numpy
pandas
matplotlib
seaborn
scikit-learn
imbalanced-learn

Model Performance Metrics

Confusion Matrix (with SMOTE):

True Negatives: 95
False Positives: 33
False Negatives: 20
True Positives: 53

Classification Report:

Non-diabetic (Class 0): 74% precision, 74% recall
Diabetic (Class 1): 56% precision, 73% recall

Key Learnings

Class imbalance is a critical issue in medical datasets
SMOTE effectively handles imbalanced data by creating synthetic samples
Trade-off between precision and recall depending on domain requirements
In healthcare, higher recall is often preferred to minimize missed diagnoses

Soumyapro/diabetes-prediction-knn