shoghli1999/management-of-scientific-data

🎓 Management of Scientific Data Project

University of Passau | Summer Semester 2024 | Team 8

Python Pandas Matplotlib ydata-profiling

📋 Overview

Practical, end-to-end data analysis on unfamiliar scientific datasets. We emphasize fast understanding and reproducibility: profiling the data, validating quality, visualizing relationships, and communicating results through shareable reports and clean scripts.

🎯 Key Features

Automated Profiling: one-command HTML data profile for rapid onboarding
Data Quality Checks: schema info, type coercion, cross-field consistency, anomaly surfacing
Exploratory Visualizations: quick plots to spot distributions and relationships
Reproducible Workflow: minimal dependencies, simple run commands, clear structure

👥 Team

Team Member
Shirin Shoghli
Sanaz Bayat
Maryam Gheibi
Mozhdeh Ramezani Dastjerdi
Shahrzad Torabi

🏗️ Project Structure

team-8-main/
├─ Datasets/                 # Source datasets (zipped)
├─ Implementations/          # Python scripts
│  ├─ descriptive-statistics.py
│  ├─ inconsistencies_and_plots.py
│  └─ pandas-profiling.py
├─ Reports/                  # Generated reports (PDF)
├─ README.md                 # You are here
└─ *.pdf / *.docx            # Submitted assignments

🚀 Quick Start

Prerequisites

Python 3.9+
pip
- Libraries: pandas, matplotlib, ydata-profiling

Install Python packages:

pip install --upgrade pip
pip install pandas matplotlib ydata-profiling

Data

Datasets are provided under Datasets/ as ZIP archives. Extract them locally. The scripts currently use absolute paths (e.g., /home/.../24246_2_data.csv). Update the paths to match your machine before running, or export an environment variable and read it in your scripts.

Running the project

Descriptive statistics (quick CSV load and preview)

python Implementations/descriptive-statistics.py

Data quality checks and plots

python Implementations/inconsistencies_and_plots.py

Expects columns like PlotID, PlotID2, Side, PredationMark. Adjust column names if your dataset differs.
A scatter plot window will open via matplotlib.

Automated profiling report

python Implementations/pandas-profiling.py

Generates an HTML report using ydata-profiling. Update the output path in the script if needed.

📊 Datasets

Archive (in `Datasets/`)
`24246_2_Dataset.zip`
`25807_3_Dataset.zip`
`31113_4_Dataset.zip`

🔧 Implementation Details

1. Descriptive Statistics (`Implementations/descriptive-statistics.py`)

Loads CSV, prints preview (head) to validate extraction and encoding

2. Data Quality & Plots (`Implementations/inconsistencies_and_plots.py`)

df.info() for schema overview
Cross-field consistency checks: PlotID vs PlotID2
Numeric coercion with error handling for Side
Simple scatter plot to quickly visualize relationships

3. Automated Profiling (`Implementations/pandas-profiling.py`)

Generates an HTML profile with ydata-profiling for distributions, missingness, correlations
Useful for onboarding and documentation of data assets

📈 Results and Deliverables

See Reports/ for submitted PDFs (e.g., Pandas_Profiling_Report.pdf, Report_Task_01.pdf, Report_Task_02.pdf).
Profiling HTML output can be regenerated locally via the profiling script.

🧪 Reproducibility Notes

Replace hardcoded absolute file paths with your local paths or environment variables.
For production or sharing, parameterize scripts with --csv <path> and --out <path>.
See Reports/ for PDF summaries (e.g., Pandas_Profiling_Report.pdf, Report_Task_01.pdf, Report_Task_02.pdf).
Profiling output (HTML) is generated by the pandas-profiling.py script and can be shared for onboarding or QA.

Reproducibility notes

Replace hardcoded absolute file paths in scripts with your local paths or environment variables for portability.
For long-term maintainability, we recommend parameterizing scripts (e.g., --csv <path> and --out <path>). If helpful, we can add a minimal CLI wrapper.

shoghli1999/management-of-scientific-data

🎓 Management of Scientific Data Project

📋 Overview

🎯 Key Features

👥 Team

🏗️ Project Structure

🚀 Quick Start

Prerequisites

Data

Running the project

📊 Datasets

🔧 Implementation Details

1. Descriptive Statistics (`Implementations/descriptive-statistics.py`)

2. Data Quality & Plots (`Implementations/inconsistencies_and_plots.py`)

3. Automated Profiling (`Implementations/pandas-profiling.py`)

📈 Results and Deliverables

🧪 Reproducibility Notes

Reproducibility notes

📚 Course Context

🛠️ Technologies Used

📝 License

On this page

Languages

shoghli1999/management-of-scientific-data

🎓 Management of Scientific Data Project

📋 Overview

🎯 Key Features

👥 Team

🏗️ Project Structure

🚀 Quick Start

Prerequisites

Data

Running the project

📊 Datasets

🔧 Implementation Details

1. Descriptive Statistics (Implementations/descriptive-statistics.py)

2. Data Quality & Plots (Implementations/inconsistencies_and_plots.py)

3. Automated Profiling (Implementations/pandas-profiling.py)

📈 Results and Deliverables

🧪 Reproducibility Notes

Reproducibility notes

📚 Course Context

🛠️ Technologies Used

📝 License

On this page

Languages

1. Descriptive Statistics (`Implementations/descriptive-statistics.py`)

2. Data Quality & Plots (`Implementations/inconsistencies_and_plots.py`)

3. Automated Profiling (`Implementations/pandas-profiling.py`)