Exploratory Data Analysis on Agricultural Dataset
A structured and detailed Exploratory Data Analysis (EDA) project using real-world agricultural production data from Indian states and districts. This project uses Python, Pandas, NumPy, Matplotlib, and Seaborn to clean, analyze, and summarize large-scale crop production data.
Table of Contents
Overview
Dataset
Objectives
Technologies Used
Exploratory Analysis Summary
Visualizations
How to Use
Project Structure
Future Work
License
Overview
This project demonstrates how to load, clean, and explore a real agricultural dataset. It includes basic statistics, missing value handling, categorical breakdowns, group-based aggregation, and initial insights generation. It’s designed as a companion chapter for learners mastering Pandas and NumPy for data preprocessing.
Dataset
File Name: Agriculture.csv
Records: 344,208 rows × 10 columns
Features include:
State, District, Crop, Year, Season
Area, Area Units, Production, Production Units, Yield
Missing Values:
Present in Crop, Production, Area, and Yield
Duplicates:
Removed during preprocessing
Objectives
Load and inspect large CSV datasets
Handle missing and duplicate values
Understand categorical distributions (states, crops, seasons)
Aggregate production data by crop and region
Identify key insights in agricultural trends
Recommend next-step visualizations and analyses
Technologies Used
Python 3
Pandas
NumPy
Matplotlib
Seaborn
Jupyter Notebook
Exploratory Analysis Summary
Used .info(), .describe(), .value_counts(), and .groupby() to explore the dataset
Cleaned unnecessary columns like 'Unnamed: 0'
Identified top crops by record frequency and total production
Assessed missing value patterns and potential strategies for imputation
Suggested visual enhancements for count plots, bar plots, and heatmaps
Visualizations (Suggested Additions)
While this notebook focuses on tabular EDA, the following visuals are recommended for extension:
Bar plot of top 10 crops by frequency
Bar plot of total production per crop
Countplot of records per state
Heatmap of correlation matrix (e.g., between Area, Production, Yield)
Time-series trends by year
How to Use
Clone this repository:
git clone https://github.com/your-username/agriculture-eda.git
Install requirements (optional):
pip install pandas numpy matplotlib seaborn jupyter
Open the notebook:
jupyter notebook EDA_agri_dataset.ipynb
Run the notebook cell by cell to follow the EDA workflow
Project Structure
├── EDA_agri_dataset.ipynb # Main notebook with step-by-step EDA
├── Agriculture.csv # Raw dataset file
├── README.md # Project documentation
├── (Optional: requirements.txt)
Future Work
Add visualization cells using Seaborn and Matplotlib
Time series analysis for crop yields and seasonal production
Feature engineering for machine learning
Outlier detection and data normalization
Export clean data for model building
License
This project is open-source and available under the MIT License.