shoghli1999/management-of-scientific-data
Management of Scientific Data: reproducible EDA, profiling, and quality checks
๐ Management of Scientific Data Project
University of Passau | Summer Semester 2024 | Team 8
Python Pandas Matplotlib ydata-profiling
๐ Overview
Practical, end-to-end data analysis on unfamiliar scientific datasets. We emphasize fast understanding and reproducibility: profiling the data, validating quality, visualizing relationships, and communicating results through shareable reports and clean scripts.
๐ฏ Key Features
- Automated Profiling: one-command HTML data profile for rapid onboarding
- Data Quality Checks: schema info, type coercion, cross-field consistency, anomaly surfacing
- Exploratory Visualizations: quick plots to spot distributions and relationships
- Reproducible Workflow: minimal dependencies, simple run commands, clear structure
๐ฅ Team
| Team Member |
|---|
| Shirin Shoghli |
| Sanaz Bayat |
| Maryam Gheibi |
| Mozhdeh Ramezani Dastjerdi |
| Shahrzad Torabi |
๐๏ธ Project Structure
team-8-main/
โโ Datasets/ # Source datasets (zipped)
โโ Implementations/ # Python scripts
โ โโ descriptive-statistics.py
โ โโ inconsistencies_and_plots.py
โ โโ pandas-profiling.py
โโ Reports/ # Generated reports (PDF)
โโ README.md # You are here
โโ *.pdf / *.docx # Submitted assignments
๐ Quick Start
Prerequisites
- Python 3.9+
- pip
- Libraries:
pandas,matplotlib,ydata-profiling
- Libraries:
Install Python packages:
pip install --upgrade pip
pip install pandas matplotlib ydata-profilingData
Datasets are provided under Datasets/ as ZIP archives. Extract them locally. The scripts currently use absolute paths (e.g., /home/.../24246_2_data.csv). Update the paths to match your machine before running, or export an environment variable and read it in your scripts.
Running the project
- Descriptive statistics (quick CSV load and preview)
python Implementations/descriptive-statistics.py- Data quality checks and plots
python Implementations/inconsistencies_and_plots.py- Expects columns like
PlotID,PlotID2,Side,PredationMark. Adjust column names if your dataset differs. - A scatter plot window will open via matplotlib.
- Automated profiling report
python Implementations/pandas-profiling.py- Generates an HTML report using
ydata-profiling. Update the output path in the script if needed.
๐ Datasets
Archive (in Datasets/) |
|---|
24246_2_Dataset.zip |
25807_3_Dataset.zip |
31113_4_Dataset.zip |
๐ง Implementation Details
1. Descriptive Statistics (Implementations/descriptive-statistics.py)
- Loads CSV, prints preview (
head) to validate extraction and encoding
2. Data Quality & Plots (Implementations/inconsistencies_and_plots.py)
df.info()for schema overview- Cross-field consistency checks:
PlotIDvsPlotID2 - Numeric coercion with error handling for
Side - Simple scatter plot to quickly visualize relationships
3. Automated Profiling (Implementations/pandas-profiling.py)
- Generates an HTML profile with
ydata-profilingfor distributions, missingness, correlations - Useful for onboarding and documentation of data assets
๐ Results and Deliverables
- See
Reports/for submitted PDFs (e.g.,Pandas_Profiling_Report.pdf,Report_Task_01.pdf,Report_Task_02.pdf). - Profiling HTML output can be regenerated locally via the profiling script.
๐งช Reproducibility Notes
- Replace hardcoded absolute file paths with your local paths or environment variables.
- For production or sharing, parameterize scripts with
--csv <path>and--out <path>. - See
Reports/for PDF summaries (e.g.,Pandas_Profiling_Report.pdf,Report_Task_01.pdf,Report_Task_02.pdf). - Profiling output (HTML) is generated by the
pandas-profiling.pyscript and can be shared for onboarding or QA.
Reproducibility notes
- Replace hardcoded absolute file paths in scripts with your local paths or environment variables for portability.
- For long-term maintainability, we recommend parameterizing scripts (e.g.,
--csv <path>and--out <path>). If helpful, we can add a minimal CLI wrapper.
๐ Course Context
- Course: Management of Scientific Data (MOSD), University of Passau
- Term: Summer Semester 2024
๐ ๏ธ Technologies Used
- Python 3.9+
- Pandas โ Data manipulation
- Matplotlib โ Visualization
- ydata-profiling โ Automated profiling
๐ License
Academic coursework. If you plan to open-source, add a license (e.g., MIT). Until then, all rights reserved by the authors.