GitHunt
PR

premkumarkora/EDA_Agri_dataset

Exploratory Data Analysis on Agricultural Dataset

A structured and detailed Exploratory Data Analysis (EDA) project using real-world agricultural production data from Indian states and districts. This project uses Python, Pandas, NumPy, Matplotlib, and Seaborn to clean, analyze, and summarize large-scale crop production data.

Table of Contents
Overview

Dataset

Objectives

Technologies Used

Exploratory Analysis Summary

Visualizations

How to Use

Project Structure

Future Work

License

Overview

This project demonstrates how to load, clean, and explore a real agricultural dataset. It includes basic statistics, missing value handling, categorical breakdowns, group-based aggregation, and initial insights generation. It’s designed as a companion chapter for learners mastering Pandas and NumPy for data preprocessing.

Dataset

File Name: Agriculture.csv

Records: 344,208 rows × 10 columns

Features include:

State, District, Crop, Year, Season
Area, Area Units, Production, Production Units, Yield
Missing Values:
Present in Crop, Production, Area, and Yield

Duplicates:
Removed during preprocessing

Objectives

Load and inspect large CSV datasets

Handle missing and duplicate values

Understand categorical distributions (states, crops, seasons)

Aggregate production data by crop and region

Identify key insights in agricultural trends

Recommend next-step visualizations and analyses

Technologies Used

Python 3

Pandas

NumPy

Matplotlib

Seaborn

Jupyter Notebook

Exploratory Analysis Summary

Used .info(), .describe(), .value_counts(), and .groupby() to explore the dataset

Cleaned unnecessary columns like 'Unnamed: 0'

Identified top crops by record frequency and total production

Assessed missing value patterns and potential strategies for imputation

Suggested visual enhancements for count plots, bar plots, and heatmaps

Visualizations (Suggested Additions)

While this notebook focuses on tabular EDA, the following visuals are recommended for extension:

Bar plot of top 10 crops by frequency

Bar plot of total production per crop

Countplot of records per state

Heatmap of correlation matrix (e.g., between Area, Production, Yield)

Time-series trends by year

How to Use

Clone this repository:

git clone https://github.com/your-username/agriculture-eda.git

Install requirements (optional):

pip install pandas numpy matplotlib seaborn jupyter

Open the notebook:

jupyter notebook EDA_agri_dataset.ipynb

Run the notebook cell by cell to follow the EDA workflow

Project Structure

├── EDA_agri_dataset.ipynb # Main notebook with step-by-step EDA

├── Agriculture.csv # Raw dataset file

├── README.md # Project documentation

├── (Optional: requirements.txt)

Future Work

Add visualization cells using Seaborn and Matplotlib

Time series analysis for crop yields and seasonal production

Feature engineering for machine learning

Outlier detection and data normalization

Export clean data for model building

License

This project is open-source and available under the MIT License.

Contributors

Created March 28, 2025
Updated May 4, 2025