GitHunt
SO

Soumyapro/diabetes-prediction-knn

A machine learning project that predicts diabetes diagnosis using K-Nearest Neighbors (KNN) classification with SMOTE-based class balancing.

Diabetes Prediction - KNN Classification with SMOTE

Overview

This project builds a machine learning model to predict diabetes diagnosis using K-Nearest Neighbors (KNN) classification. It implements SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance and improve detection of positive diabetes cases.

Project Highlights

  • Algorithm: K-Nearest Neighbors (KNN) with k=13
  • Accuracy: 73-78% depending on balancing approach
  • Key Achievement: 73% recall for diabetes detection (improved from 53% without SMOTE)
  • False Negatives Reduced: From 35 to 20 cases with SMOTE balancing

Dataset

Source: Diabetes dataset (769 samples)

Features (8 clinical indicators):

  • Pregnancies
  • Glucose level
  • Blood Pressure
  • Skin Thickness
  • Insulin level
  • BMI (Body Mass Index)
  • Diabetes Pedigree Function
  • Age

Target: Outcome (Binary: 0 = Non-diabetic, 1 = Diabetic)

Project Structure

├── diabetes_prediction.ipynb    # Main Jupyter notebook with complete analysis
├── diabetes.csv                 # Dataset file
└── README.md

Key Steps

1. Exploratory Data Analysis (EDA)

  • Data shape and info analysis
  • Missing values and duplicates check
  • Distribution analysis with histograms
  • Boxplots for outlier detection
  • Correlation matrix heatmap
  • Pair plots by outcome class

2. Data Preprocessing

  • Feature scaling using StandardScaler
  • Train-test split (70-30)

3. Model Development

  • KNN hyperparameter optimization (k=1 to 14)
  • Identified optimal k=13 based on test score

4. Class Imbalance Handling

  • Applied SMOTE to balance training data
  • Compared model performance before and after

5. Model Evaluation

  • Confusion matrix
  • Classification report (precision, recall, F1-score)
  • Performance metrics comparison

Results Comparison

Before SMOTE

  • Recall (Diabetic Class): 53%
  • Precision (Diabetic Class): 71%
  • Overall Accuracy: 78%
  • False Negatives: 35

After SMOTE

  • Recall (Diabetic Class): 73%
  • Precision (Diabetic Class): 56%
  • Overall Accuracy: 73%
  • False Negatives: 20

Insight: SMOTE significantly improves disease detection (recall), reducing missed cases by 43%. In medical contexts, reducing false negatives is often prioritized to ensure cases don't go undetected.

Requirements

numpy
pandas
matplotlib
seaborn
scikit-learn
imbalanced-learn

Model Performance Metrics

Confusion Matrix (with SMOTE):

True Negatives: 95
False Positives: 33
False Negatives: 20
True Positives: 53

Classification Report:

  • Non-diabetic (Class 0): 74% precision, 74% recall
  • Diabetic (Class 1): 56% precision, 73% recall

Key Learnings

  1. Class imbalance is a critical issue in medical datasets
  2. SMOTE effectively handles imbalanced data by creating synthetic samples
  3. Trade-off between precision and recall depending on domain requirements
  4. In healthcare, higher recall is often preferred to minimize missed diagnoses
Soumyapro/diabetes-prediction-knn | GitHunt