GitHunt
PR

Pranshu936/Data_cleaning

Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users.

๐Ÿงน Data Sweeper Pro+

Data Sweeper Pro+ is an advanced data cleaning and transformation platform built with Streamlit. It allows users to upload datasets, clean them, analyze them with interactive profiling reports, and export the cleaned data in multiple formats. The app is designed for both technical and non-technical users, offering a user-friendly interface with powerful data processing capabilities.


๐Ÿš€ Features

1. File Upload

  • Supports multiple file uploads in CSV and Excel formats.
  • Handles large datasets efficiently.

2. Data Profiling

  • Generate an interactive Profile Report using ydata-profiling to explore:
    • Missing values
    • Duplicate rows
    • Data types
    • Statistical summaries
    • Correlations
  • Fully interactive HTML report embedded in the app.

3. Data Cleaning

  • Remove duplicate rows.
  • Handle missing values with strategies like:
    • Drop rows
    • Fill with mean/median
    • KNN imputation.
  • Normalize numerical columns.

4. Data Transformations

Column Operations:

  • Select specific columns to keep or reorder them.

Data Type Conversion:

  • Convert columns to desired data types: string, integer, float, or datetime.

Feature Engineering:

  • Add new columns based on existing ones (e.g., sum of two columns).
  • Extract date parts (e.g., year from a date column).
  • Apply custom formulas for advanced transformations.

5. Visualization

  • Generate interactive charts using Plotly:
    • Histograms
    • Scatter plots
    • Box plots
    • Line charts

6. Export Options

  • Export cleaned data in multiple formats:
    • CSV
    • Excel
    • JSON

๐Ÿ› ๏ธ Installation

Prerequisites:

  • Python >= 3.12
  • pip (Python package manager)

Step-by-Step Guide:

  1. Clone the repository:

    git clone https://github.com/your-repo/data-sweeper-pro.git
    cd data-sweeper-pro
  2. Create a virtual environment (optional but recommended):

    python -m venv myenv
    source myenv/bin/activate    # On Linux/MacOS
    myenv\Scripts\activate       # On Windows
  3. Install dependencies:

    pip install -r requirements.txt
  4. Run the app:

    streamlit run app.py
  5. Open the app in your browser at http://localhost:8501.


๐Ÿ“‚ Directory Structure

data-sweeper-pro/
โ”œโ”€โ”€ .streamlit/
โ”‚   โ””โ”€โ”€ config.toml       # Streamlit theme configuration
โ”œโ”€โ”€ app.py                # Main Streamlit application script
โ”œโ”€โ”€ requirements.txt      # Python dependencies list
โ”œโ”€โ”€ large_test_data.csv   # Example large dataset for testing (optional)
โ””โ”€โ”€ README.md             # Project documentation (this file)

๐Ÿ“Š Example Use Case

  1. Upload a dataset (large_test_data.csv) containing missing values, duplicates, and mixed data types.
  2. Generate a full profile report to explore the dataset.
  3. Clean the data by removing duplicates, handling missing values, and normalizing numerical columns.
  4. Apply transformations like converting column types or creating new features.
  5. Visualize trends and patterns using interactive charts.
  6. Export the cleaned dataset as a CSV or Excel file.

๐Ÿงฉ Dependencies

The following Python libraries are used in this project:

streamlit==1.29.0
pandas==2.1.3
numpy==1.26.4
plotly==5.18.0
ydata-profiling==4.12.2
scikit-learn==1.3.2
openpyxl==3.1.2
scipy==1.11.4

Install them using:

pip install -r requirements.txt

๐Ÿค Contributing

Contributions are welcome! Please follow these steps:

  1. Fork this repository.
  2. Create a new branch (git checkout -b feature-name).
  3. Commit your changes (git commit -m "Add feature-name").
  4. Push to your branch (git push origin feature-name).
  5. Open a pull request.

Pranshu936/Data_cleaning | GitHunt