Pranshu936/Data_cleaning
Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users.
๐งน Data Sweeper Pro+
Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users, offering a user-friendly interface with powerful data processing capabilities.
๐ Features
1. File Upload
- Supports multiple file uploads in CSV and Excel formats.
- Handles large datasets efficiently.
2. Data Profiling
- Generate an interactive Profile Report using
ydata-profilingto explore:- Missing values
- Duplicate rows
- Data types
- Statistical summaries
- Correlations
- Fully interactive HTML report embedded in the app.
3. Data Cleaning
- Remove duplicate rows.
- Handle missing values with strategies like:
- Drop rows
- Fill with mean/median
- KNN imputation.
- Normalize numerical columns.
4. Data Transformations
Column Operations:
- Select specific columns to keep or reorder them.
Data Type Conversion:
- Convert columns to desired data types:
string,integer,float, ordatetime.
Feature Engineering:
- Add new columns based on existing ones (e.g., sum of two columns).
- Extract date parts (e.g., year from a date column).
- Apply custom formulas for advanced transformations.
5. Visualization
- Generate interactive charts using Plotly:
- Histograms
- Scatter plots
- Box plots
- Line charts
6. Export Options
- Export cleaned data in multiple formats:
- CSV
- Excel
- JSON
๐ ๏ธ Installation
Prerequisites:
- Python >= 3.12
- pip (Python package manager)
Step-by-Step Guide:
-
Clone the repository:
git clone https://github.com/your-repo/data-sweeper-pro.git cd data-sweeper-pro -
Create a virtual environment (optional but recommended):
python -m venv myenv source myenv/bin/activate # On Linux/MacOS myenv\Scripts\activate # On Windows
-
Install dependencies:
pip install -r requirements.txt
-
Run the app:
streamlit run app.py
-
Open the app in your browser at
http://localhost:8501.
๐ Directory Structure
data-sweeper-pro/
โโโ .streamlit/
โ โโโ config.toml # Streamlit theme configuration
โโโ app.py # Main Streamlit application script
โโโ requirements.txt # Python dependencies list
โโโ large_test_data.csv # Example large dataset for testing (optional)
โโโ README.md # Project documentation (this file)
๐ Example Use Case
- Upload a dataset (
large_test_data.csv) containing missing values, duplicates, and mixed data types. - Generate a full profile report to explore the dataset.
- Clean the data by removing duplicates, handling missing values, and normalizing numerical columns.
- Apply transformations like converting column types or creating new features.
- Visualize trends and patterns using interactive charts.
- Export the cleaned dataset as a CSV or Excel file.
๐งฉ Dependencies
The following Python libraries are used in this project:
streamlit==1.29.0
pandas==2.1.3
numpy==1.26.4
plotly==5.18.0
ydata-profiling==4.12.2
scikit-learn==1.3.2
openpyxl==3.1.2
scipy==1.11.4Install them using:
pip install -r requirements.txt๐ค Contributing
Contributions are welcome! Please follow these steps:
- Fork this repository.
- Create a new branch (
git checkout -b feature-name). - Commit your changes (
git commit -m "Add feature-name"). - Push to your branch (
git push origin feature-name). - Open a pull request.