NhanPhamThanh-IT/K-Nearest-Neighbors-Tutorial
📘 This repository offers a complete K-Nearest Neighbors (KNN) tutorial, guiding you from core theory to hands-on practice. Learn to implement KNN from scratch with NumPy, apply it using scikit-learn, and explore visualizations, datasets, and Jupyter notebooks to fully understand, test, and optimize the algorithm.
K-Nearest Neighbors (KNN) Tutorial 📘
Comprehensive, concept-to-code walkthrough of the KNN algorithm for both classification and regression: theory, intuition, math, helper utilities, notebook experimentation, and a roadmap for extending to a full reusable implementation.
🔍 Table of Contents
- Motivation & Goals
- What Is KNN? (High-Level Intuition)
- Key Concepts (Supervised / Instance-Based / Non-Parametric)
- When to Use KNN / When Not To
- Mathematical Foundations
- Distance Metrics (Focus: L2 / Euclidean)
- Choosing K (Bias–Variance & Practical Heuristics)
- Classification vs Regression Behavior
- Handling Ties, Weights & Edge Cases
- Data Preparation (Scaling, Cleaning, Dimensionality)
- Complexity & Performance Considerations
- Repository Structure
- Installation & Environment Setup
- Quick Start Examples (From-Scratch Logic & Helper Utilities)
- Typical Workflow (End‑to‑End)
- Evaluation Metrics (Classification & Regression)
- Visualizations & Exploratory Analysis
- Extending the Project (Roadmap)
- Common Pitfalls & Debugging Checklist
- Reference Reading & Further Study
- License
1. Motivation & Goals
This tutorial aims to give you BOTH the theory (why KNN works, its trade‑offs) and the practice (how to implement core pieces, experiment in a notebook, evaluate, and optimize). It is intentionally incremental so you can:
- Start with conceptual foundations.
- Use helper utilities (distance calculation, majority/mean aggregation).
- Prototype classification & regression in a notebook.
- Later add a reusable KNN class (see roadmap).
2. What Is KNN? (High-Level Intuition)
KNN predicts the label (class or numeric value) of a query point by looking at the K most similar (closest) points in the training set. No parameters are “fit” in advance—prediction is deferred until you ask a question of the model. It is sometimes called a memory-based, lazy, or instance-based method.
3. Key Concepts
Linked docs in docs/ expand each topic:
- Supervised Learning (
supervised-learning.md): labeled examples drive learning. - Classification / Regression (
classification-problem.md,regression-problem.md). - Instance-Based (
instance-base.md): store all examples; compare at query time. - Non-Parametric (
non-parametric.md): flexibility grows with data volume.
4. When to Use / When Not To
Use KNN when:
- Decision boundary is irregular / hard to parametrize.
- Dataset is small-to-medium (memory fits; latency acceptable).
- You want a strong baseline quickly.
Avoid or be cautious when:
- Very large datasets → prediction cost O(N * d) per query.
- High-dimensional feature spaces (curse of dimensionality).
- Features on different scales (must scale/normalize first).
- Many irrelevant/noisy features (distance metric degrades).
5. Mathematical Foundations (Core Mechanics)
Given training set T = {(xᵢ, yᵢ)} for i = 1..N and a query point x*:
- Compute distance d(x*, xᵢ) for all i.
- Select K indices of smallest distances: N_K.
- Classification: predict mode({ yᵢ | i ∈ N_K }).
- Regression: predict average or weighted average of { yᵢ | i ∈ N_K }.
Pseudocode (Generic)
function predict(x*, X_train, y_train, K, distance_fn, task, weight=False):
distances = []
for i in range(len(X_train)):
d = distance_fn(x*, X_train[i])
distances.append( (d, y_train[i]) )
distances.sort by d ascending
neighbors = first K elements
if task == 'classification':
return majority_vote([label for (_, label) in neighbors])
else: # regression
if weight:
return weighted_mean(neighbors, w = 1 / (d + ε))
else:
return simple_mean(neighbors)
6. Distance Metrics (Focus: L2)
The Euclidean (L2) distance is most common (see L2-distance.md). Other options:
- Manhattan (L1)
- Minkowski (generalized p-norm)
- Cosine distance (if direction matters more than magnitude)
- Hamming distance (categorical / binary vectors)
Metric choice affects neighbor relationships—experiment.
7. Choosing K
Trade‑off:
- Small K → low bias, high variance (sensitive to noise).
- Large K → smoother, higher bias (can underfit).
Heuristics: - Try odd K for binary classification (avoid ties).
- Start with sqrt(N) as a coarse upper bound to explore.
- Use cross‑validation grid (e.g., K in {1,3,5,7,11,15,…}).
8. Classification vs Regression Behavior
- Classification: Majority vote; can introduce distance weighting (e.g., weight = 1 / (d + ε)).
- Regression: Mean vs weighted mean; handle zero distance (exact duplicate) by immediately returning that label/value or assigning infinite weight (see helper function approach).
9. Handling Ties, Weights & Edge Cases
Edge considerations:
- Ties in classification: break by (a) smallest cumulative distance, (b) deterministic ordering, or (c) reducing K by 1 recursively.
- Distance zero: identical point encountered—shortcut return.
- Missing values: impute before KNN (mean / median / KNN-imputer) else distance invalid.
- Feature scaling: always scale numerical features (e.g., StandardScaler / MinMaxScaler) to prevent dominance.
10. Data Preparation
Recommended pipeline:
- Clean anomalies / duplicates.
- Encode categorical variables (one-hot, ordinal if appropriate).
- Scale numeric features.
- Optionally reduce dimensionality (PCA) if d is large.
- Split into train / validation / test.
11. Complexity & Performance
- Training: O(1) (just store data).
- Prediction (naïve): O(N * d) per query.
- Memory: O(N * d).
Acceleration options (future roadmap): - KD-Tree / Ball Tree (works best for moderate dimensions < ~30).
- Locality Sensitive Hashing (approximate for high-d sparse data).
- Annoy / FAISS for large-scale approximate neighbor search.
12. Repository Structure
├── docs/ # Theory & concept explanations
│ ├── KNN.md
│ ├── L2-distance.md
│ ├── instance-base.md
│ ├── non-parametric.md
│ ├── classification-problem.md
│ ├── regression-problem.md
│ └── supervised-learning.md
├── notebook/
│ ├── practice.ipynb # Interactive experimentation space
│ └── utils/
│ ├── HelperFunctions.py # Plotting, L2 distance, frequency & prediction helpers
│ └── __init__.py
├── resource/
│ └── tutorial.pdf # Supplemental reading (overview)
├── LICENSE
└── README.md
13. Installation & Environment Setup
Minimal dependencies (implied by helper code & typical usage):
- Python >= 3.10
- numpy
- pandas
- matplotlib
Create & activate environment (examples):
# (Windows PowerShell) create venv
python -m venv .venv
./.venv/Scripts/Activate.ps1
pip install --upgrade pip
pip install numpy pandas matplotlib scikit-learn
Optional (for experiments):
- seaborn (nicer visuals)
- jupyter / ipykernel
pip install seaborn jupyter ipykernel
python -m ipykernel install --user --name knn-tutorial
14. Quick Start Examples
Helper: Compute L2 Distance
from notebook.utils.HelperFunctions import HelperFunctions
p1 = [1, 2, 3]
p2 = [4, 6, 8]
dist = HelperFunctions.calculateL2Distance(p1, p2)
print(dist)Majority Vote (Classification) Using a Pandas DataFrame Column
import pandas as pd
from notebook.utils.HelperFunctions import HelperFunctions
df = pd.DataFrame({
'label': ['A','B','A','C','A','B']
})
pred = HelperFunctions.predictionClassification(df, 'label')
print('Predicted label:', pred)Simple Weighted Regression Aggregation
Assume a DataFrame of neighbor targets & distances:
import pandas as pd
from notebook.utils.HelperFunctions import HelperFunctions
neighbors = pd.DataFrame({
'value': [10.0, 12.0, 15.0],
'distance': [0.5, 1.2, 2.0]
})
estimate = HelperFunctions.predictionRegression(neighbors, target='value', distance='distance', weight=True)
print('Weighted estimate:', estimate)Skeleton: Manual KNN Classification Loop (From Scratch)
import math
import numpy as np
from collections import Counter
def l2(a, b):
return math.dist(a, b)
def knn_predict(x_query, X_train, y_train, K=5):
distances = [(l2(x_query, X_train[i]), y_train[i]) for i in range(len(X_train))]
distances.sort(key=lambda t: t[0])
topk = [label for (_, label) in distances[:K]]
return Counter(topk).most_common(1)[0][0]
X_train = np.array([[1,2],[2,3],[3,3],[6,5],[7,8]])
y_train = np.array(['A','A','B','B','B'])
print(knn_predict([2.5,2.5], X_train, y_train, K=3))15. Typical Workflow (End‑to‑End)
- Load dataset (CSV or synthetic generation).
- Split train/validation/test.
- Scale numeric features (StandardScaler / MinMax).
- Loop over candidate K values; evaluate.
- Select best K; retrain (store training set + chosen hyperparams).
- Evaluate on test set; record metrics.
- Visualize decision boundaries (2D) or performance curves.
- Package as a reusable class (future step).
16. Evaluation Metrics
Classification:
- Accuracy (baseline)
- Precision / Recall / F1 (esp. for imbalance)
- Confusion Matrix
- ROC-AUC (probability-style adaptation via distance weighting / inverse aggregated scores)
Regression:
- MAE, MSE / RMSE
- R²
- Median Absolute Error (robust to outliers)
17. Visualization & Exploration Ideas
Use matplotlib (and optionally seaborn) to:
- Scatter plot colored by class.
- Plot effect of K vs validation accuracy (elbow view).
- Distance distribution histogram for query points.
- Decision boundary grid (for 2D synthetic datasets).
The HelperFunctions.drawScatterPlot provides a light starting point.
18. Extending the Project (Roadmap)
Planned / Suggested Enhancements:
- Add a
KNNClassifierandKNNRegressorclass (API similar to scikit-learn:fit,predict). - Support multiple distance metrics with strategy pattern.
- Implement KD-Tree / Ball Tree fallback based on dimension threshold.
- Add unit tests (pytest) for helper utilities & future classes.
- Add benchmarking script (timing naive vs tree-based search).
- Integrate synthetic dataset generators (circles, moons, blobs).
- Provide decision boundary visualization utility.
- Add cross-validation helper.
19. Common Pitfalls & Debugging Checklist
- Forgetting to scale features → distorted neighbor ranking.
- Using too large K → overly smooth / majority class dominance.
- Imbalanced classes → evaluate with F1, not only accuracy.
- High dimensional sparse vectors → consider approximate methods.
- Duplicated rows with contradictory labels → clean data.
- Performance bottleneck → pre-compute structures or batch queries.
20. Reference Reading & Further Study
Concept docs inside docs/ are fully self-contained. For deeper dives:
- Pattern Recognition & Machine Learning – C. M. Bishop (Ch. on non-parametric methods)
- Elements of Statistical Learning – Hastie, Tibshirani, Friedman (KNN & kernel methods)
- scikit-learn user guide (Nearest Neighbors section)
- Research approximate nearest neighbor libraries: FAISS, Annoy, HNSW.
21. License
This project is distributed under the terms of the MIT License. See LICENSE for details.
🤝 Contributions
Feel free to fork & open a pull request adding:
- A formal KNN implementation
- More distance metrics
- Performance benchmarks
- Additional instructional notebooks
✅ Summary
This repository provides conceptual foundations, helper utilities, and a sandbox (practice.ipynb) to learn and extend the K-Nearest Neighbors algorithm. Build from here toward a robust, tested mini-library and experiment with optimizations as datasets grow.
🏁 Conclusion
K-Nearest Neighbors remains one of the most intuitive algorithms in machine learning—powerful as a baseline, instructive for learning core concepts like distance, feature scaling, and bias–variance trade-offs. This tutorial equips you with:
- A solid theoretical grounding (supervised / non‑parametric / instance‑based context).
- Practical utility functions (distance, voting, weighting) to bootstrap experiments.
- A structured workflow for evaluation, tuning, and visualization.
- A forward-looking roadmap to transform learning artifacts into a reusable mini-framework.
Suggested next steps for you:
- Implement a
KNNClassifierclass withfit,predict, and optionalpredict_proba(distance-weighted vote proportions). - Add a
KNNRegressorcounterpart with optional kernel weighting (e.g., Gaussian kernel over distances). - Introduce hyperparameter search (grid / randomized) with cross-validation wrappers.
- Benchmark naive vs tree-based neighbor search on synthetic datasets of increasing size.
- Package your implementation (e.g.,
pip install -e .) and publish as a learning artifact.
If you build upon this, consider documenting performance trade-offs and adding reproducible experiment scripts. Your future self (and contributors) will thank you.
Happy learning & exploring—may your nearest neighbors always be informative! 🚀