Amey-Thakur/TSF-UNSUPERVISED-MACHINE-LEARNING
Task: From the given ‘Iris’ dataset, predict the optimum number of clusters and represent it visually.
Unsupervised Machine Learning
A clustering analysis demonstrating the application of K-Means Algorithm to identify optimum clusters in the Iris dataset and visualize the underlying patterns.
Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments
Important
🤝🏻 Special Acknowledgement
Special thanks to Mega Satish for her meaningful contributions, guidance, and support that helped shape this work.
Overview
Unsupervised Machine Learning - Task 2 is a core Data Science exploration conducted under the Graduate Rotational Internship Program (GRIP) at The Sparks Foundation. The project focuses on the Iris Dataset to predict the optimum number of clusters and represent them visually.
By leveraging the K-Means Clustering algorithm, the system iteratively partitions data points into
Computational Objectives
The analysis is governed by strict exploratory principles ensuring cluster validity:
-
Elbow Method: Determining the optimal value of
$K$ by plotting the Within-Cluster Sum of Squares (WCSS). - Cluster Centroids: Computing the central vector for each species group.
- Dimensionality Visualization: Plotting the classified clusters in a 2D feature space.
Tip
Algorithm Sensitivity: K-Means clustering assumes spherical clusters of similar density. Given the slight overlap between Versicolor and Virginica classes, the algorithm's performance highlights the importance of feature scaling and centroid initialization in density-based separation tasks.
Features
| Component | Technical Description |
|---|---|
| Ingestion Pipeline | Automated data retrieval and parsing using Pandas for the Iris dataset. |
| Optimal K Selection | Implementation of the Elbow Method to minimize WCSS/Inertia. |
| Model Architecture | application of KMeans from Scikit-Learn for centroid initialization and fitting. |
| Visualization | Scatter plotting of clusters and centroids using Matplotlib. |
| Cluster Prediction | Assigning new sample points to the nearest established cluster. |
Note
Empirical Context
The dataset comprises multivariate floral attributes (Sepal/Petal dimensions). The distinct separation of species in the feature space justifies the use of K-Means Clustering to identify inherent groupings without labeled supervision. The Elbow Method empirically validates
Tech Stack
- Runtime: Python 3.x
- Data Manipulation: Pandas, NumPy
- Visualization: Matplotlib
- Machine Learning: Scikit-Learn (sklearn)
- Environment: Jupyter Notebook / Google Colab
Project Structure
TSF-UNSUPERVISED-MACHINE-LEARNING/
│
├── docs/ # Technical Documentation
│ └── SPECIFICATION.md # Architecture & Design Specification
│
├── Mega/ # Archival Attribution Assets
│ ├── Filly.jpg # Companion (Filly)
│ ├── Mega.png # Author Profile Image (Mega Satish)
│ └── ... # Additional Attribution Files
│
├── Source Code/ # Core Implementation
│ └── TSF_INTERNSHIP_TASK_2_UNSUPERVISED_LEARNING.ipynb # Jupyter Notebook (Analysis Kernel)
│ └── Iris.csv # Empirical Data Source
│
├── screenshots/ # Result Visualization (Empty for setup)
│
├── .gitattributes # Git configuration
├── .gitignore # Repository Filters
├── CITATION.cff # Scholarly Citation Metadata
├── codemeta.json # Machine-Readable Project Metadata
├── LICENSE # MIT License Terms
├── README.md # Project Documentation
└── SECURITY.md # Security PolicyResults
Determining the ideal number of clusters by minimizing WCSS.
From the above plot, a clear elbow is seen to be formed at 3. Thus, the optimum number of clusters is 3.
2. Model Inference: Cluster Visualization
Scatter plot exhibiting the separation of Iris species into 3 distinct clusters: Setosa (Blue), Versicolor (Green), and Virginica (Yellow), with Centroids marked in Red.

Quick Start
1. Prerequisites
- Python 3.7+: Required for runtime execution. Download Python
- Jupyter Environment: For interactive code execution (JupyterLab or Notebook).
Warning
Data Path Integrity
The analysis kernel relies on relative file paths. Ensure Iris.csv remains accessible to the notebook. Modifying the directory structure without updating the ingestion logic will result in FileNotFoundError during runtime.
2. Installation
Establish the local environment by cloning the repository and installing the computational stack:
# Clone the repository
git clone https://github.com/Amey-Thakur/TSF-UNSUPERVISED-MACHINE-LEARNING.git
cd TSF-UNSUPERVISED-MACHINE-LEARNING
# Install clustering dependencies
pip install pandas numpy matplotlib seaborn scikit-learn3. Execution
Launch the analysis kernel to reproduce the findings:
jupyter notebook "Source Code/TSF_INTERNSHIP_TASK_2_UNSUPERVISED_LEARNING.ipynb"Tip
Interactive Cluster Analytics | Iris Species Grouping
Explore the high-fidelity Live Demo to visualize the K-Means Clustering process in action. The interactive result analysis showcases Optimal K Selection via the Elbow Method and high-fidelity Cluster Visualization, demonstrating the separation of Iris species into three distinct non-overlapping subgroups based on multivariate floral attributes.
Usage Guidelines
This repository is openly shared to support learning and knowledge exchange across the academic community.
For Students
Use this project as reference material for understanding unsupervised learning pipelines, K-Means clustering, and optimum cluster prediction. The source code is available for study to facilitate self-paced learning and exploration of Hyperparameter tuning (Elbow Method).
For Educators
This project may serve as a practical lab example or supplementary teaching resource for Data Science and Applied Statistics courses. Attribution is appreciated when utilizing content.
For Researchers
The documentation and architectural approach may provide insights into academic project structuring, predictive inference, and industrial internship artifacts.
License
This academic submission, developed for the Graduate Rotational Internship Program (GRIP) at The Sparks Foundation, is made available under the MIT License. See the LICENSE file for complete terms.
Note
Summary: You are free to share and adapt this content for any purpose, even commercially, as long as you provide appropriate attribution to the original authors.
Copyright © 2021 Amey Thakur & Mega Satish
About This Repository
Created & Maintained by: Amey Thakur & Mega Satish
Role: Data Science & Business Analytics Interns
Program: Graduate Rotational Internship Program (GRIP)
Organization: The Sparks Foundation
This project features Unsupervised Machine Learning - Task 2, a clustering analytics study conducted as part of the GRIP Internship. It explores the application of K-Means to solve grouping problems.
Connect: GitHub · LinkedIn · ORCID
Acknowledgments
Grateful acknowledgment to Mega Satish for her exceptional collaboration and scholarly partnership during the execution of this data science internship task. Her analytical precision, deep understanding of clustering algorithms, and constant support were instrumental in refining the models used in this study. Working alongside her was a transformative experience; her thoughtful approach to problem-solving and steady encouragement turned complex analytical challenges into meaningful learning moments. This work reflects the growth and insights gained from our side-by-side academic journey. Thank you, Mega, for everything you shared and taught along the way.
Special thanks to the mentors at The Sparks Foundation for providing this platform for rapid skill development and industrial exposure.
Authors · Overview · Features · Structure · Results · Quick Start · License · About · Acknowledgments
📈 TSF-UNSUPERVISED-MACHINE-LEARNING
Presented as part of the Internship @ The Sparks Foundation
🎓 Computer Engineering Repository
Computer Engineering (B.E.) - University of Mumbai
Semester-wise curriculum, laboratories, projects, and academic notes.

