GitHunt
SH

shoghli1999/management-of-scientific-data

Management of Scientific Data: reproducible EDA, profiling, and quality checks

๐ŸŽ“ Management of Scientific Data Project

University of Passau | Summer Semester 2024 | Team 8

Python Pandas Matplotlib ydata-profiling

๐Ÿ“‹ Overview

Practical, end-to-end data analysis on unfamiliar scientific datasets. We emphasize fast understanding and reproducibility: profiling the data, validating quality, visualizing relationships, and communicating results through shareable reports and clean scripts.

๐ŸŽฏ Key Features

  • Automated Profiling: one-command HTML data profile for rapid onboarding
  • Data Quality Checks: schema info, type coercion, cross-field consistency, anomaly surfacing
  • Exploratory Visualizations: quick plots to spot distributions and relationships
  • Reproducible Workflow: minimal dependencies, simple run commands, clear structure

๐Ÿ‘ฅ Team

Team Member
Shirin Shoghli
Sanaz Bayat
Maryam Gheibi
Mozhdeh Ramezani Dastjerdi
Shahrzad Torabi

๐Ÿ—๏ธ Project Structure

team-8-main/
โ”œโ”€ Datasets/                 # Source datasets (zipped)
โ”œโ”€ Implementations/          # Python scripts
โ”‚  โ”œโ”€ descriptive-statistics.py
โ”‚  โ”œโ”€ inconsistencies_and_plots.py
โ”‚  โ””โ”€ pandas-profiling.py
โ”œโ”€ Reports/                  # Generated reports (PDF)
โ”œโ”€ README.md                 # You are here
โ””โ”€ *.pdf / *.docx            # Submitted assignments

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.9+
  • pip
    • Libraries: pandas, matplotlib, ydata-profiling

Install Python packages:

pip install --upgrade pip
pip install pandas matplotlib ydata-profiling

Data

Datasets are provided under Datasets/ as ZIP archives. Extract them locally. The scripts currently use absolute paths (e.g., /home/.../24246_2_data.csv). Update the paths to match your machine before running, or export an environment variable and read it in your scripts.

Running the project

  1. Descriptive statistics (quick CSV load and preview)
python Implementations/descriptive-statistics.py
  1. Data quality checks and plots
python Implementations/inconsistencies_and_plots.py
  • Expects columns like PlotID, PlotID2, Side, PredationMark. Adjust column names if your dataset differs.
  • A scatter plot window will open via matplotlib.
  1. Automated profiling report
python Implementations/pandas-profiling.py
  • Generates an HTML report using ydata-profiling. Update the output path in the script if needed.

๐Ÿ“Š Datasets

Archive (in Datasets/)
24246_2_Dataset.zip
25807_3_Dataset.zip
31113_4_Dataset.zip

๐Ÿ”ง Implementation Details

1. Descriptive Statistics (Implementations/descriptive-statistics.py)

  • Loads CSV, prints preview (head) to validate extraction and encoding

2. Data Quality & Plots (Implementations/inconsistencies_and_plots.py)

  • df.info() for schema overview
  • Cross-field consistency checks: PlotID vs PlotID2
  • Numeric coercion with error handling for Side
  • Simple scatter plot to quickly visualize relationships

3. Automated Profiling (Implementations/pandas-profiling.py)

  • Generates an HTML profile with ydata-profiling for distributions, missingness, correlations
  • Useful for onboarding and documentation of data assets

๐Ÿ“ˆ Results and Deliverables

  • See Reports/ for submitted PDFs (e.g., Pandas_Profiling_Report.pdf, Report_Task_01.pdf, Report_Task_02.pdf).
  • Profiling HTML output can be regenerated locally via the profiling script.

๐Ÿงช Reproducibility Notes

  • Replace hardcoded absolute file paths with your local paths or environment variables.
  • For production or sharing, parameterize scripts with --csv <path> and --out <path>.
  • See Reports/ for PDF summaries (e.g., Pandas_Profiling_Report.pdf, Report_Task_01.pdf, Report_Task_02.pdf).
  • Profiling output (HTML) is generated by the pandas-profiling.py script and can be shared for onboarding or QA.

Reproducibility notes

  • Replace hardcoded absolute file paths in scripts with your local paths or environment variables for portability.
  • For long-term maintainability, we recommend parameterizing scripts (e.g., --csv <path> and --out <path>). If helpful, we can add a minimal CLI wrapper.

๐Ÿ“š Course Context

  • Course: Management of Scientific Data (MOSD), University of Passau
  • Term: Summer Semester 2024

๐Ÿ› ๏ธ Technologies Used

  • Python 3.9+
  • Pandas โ€” Data manipulation
  • Matplotlib โ€” Visualization
  • ydata-profiling โ€” Automated profiling

๐Ÿ“ License

Academic coursework. If you plan to open-source, add a license (e.g., MIT). Until then, all rights reserved by the authors.