lucascarpantonio/Data_wrangling_with_python
This project is part of the Udacity Data Analysis Nanodegree. It focuses on gathering, cleaning, and analyzing two related datasets: Weather conditions in Rome for the year 2024 Air quality measurements (PM2.5 and NOβ) in Rome for the same period The aim is to demonstrate skills in data wrangling, API usage, and exploratory data analysis
Data Wrangling with Python β Rome Weather & Air Quality
This repository contains the final project for the Data Wrangling with Python course.
The goal is to gather, assess, clean, and analyze two real-world datasets in order to answer a research question about the relationship between weather conditions and air quality in Rome.
π Project Structure
βββ data/
β βββ raw/ # Raw datasets before cleaning
β β βββ air_quality_rome_2022.csv
β β βββ air_quality_rome_2023.csv
β β βββ weather_rome_2022_2024.csv
β βββ cleaned/ # Cleaned datasets after wrangling
β βββ merged/ # Final dataset (merged and tidy)
βββ modules/ # Custom Python modules for data gathering
β βββ openaq_loader.py # Fetch and save daily air quality data from OpenAQ API
β βββ openmeteo_weather.py # Fetch and process daily weather data from Open-Meteo API
βββ data_wrangling_project_filled.ipynb # Main notebook with all steps
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
π Datasets
Dataset 1 β Weather Data
- Source: Open-Meteo API
- Method: Programmatic API request using a custom Python module (
openmeteo_weather.py). - Variables:
dateβ date of observationtemp_cβ daily average temperature (Β°C)rhum_pctβ average relative humidity (%)wind_speed_msβ average wind speed (m/s)precip_mmβ daily precipitation (mm)
Dataset 2 β Air Quality Data
- Source: OpenAQ API
- Method: Programmatic API request, aggregated daily, and saved as CSV (
openaq_loader.py). - Variables:
dateβ date of observationcityβ observation city (Rome)pm25β daily average PM2.5 concentration (Β΅g/mΒ³)no2β daily average NOβ concentration (Β΅g/mΒ³)
βοΈ Project Steps
-
Gathering
- Weather data via Open-Meteo API (JSON β DataFrame).
- Air quality data via OpenAQ API, saved as CSV locally.
-
Assessing
- Checked for missing values, duplicates, and outliers.
- Identified data quality issues (e.g., NaN values, extreme outliers in precipitation).
- Identified tidiness issues (pollutants stored as separate columns instead of a variable).
-
Cleaning
- Converted date columns to proper datetime format.
- Removed duplicates and handled missing values.
- Reshaped the air quality dataset into tidy format with
pd.melt().
-
Storing
- Saved cleaned datasets to
/data/cleaned/. - Produced a merged tidy dataset in
/data/merged/.
- Saved cleaned datasets to
-
Analysis & Visualization
- Explored relationships between weather variables and pollutant concentrations.
- Created visualizations (boxplots, scatterplots, line charts).
- Justified outliers (e.g., heavy rainfall days are meteorologically plausible).
β Research Question
How do weather conditions (temperature, humidity, wind speed, precipitation) affect air quality (PM2.5 and NOβ) in Rome between 2022 and 2024?
π Results
- Higher wind speeds are generally associated with lower pollutant concentrations, as wind disperses particles.
- Heavy rainfall events correspond to outliers in precipitation but are meteorologically valid and often associated with cleaner air afterwards.
- NOβ concentrations are higher in colder months, consistent with traffic and heating emissions.
π Next Steps
If given more time:
- Extend analysis to multiple cities (e.g., Milan, Naples) to compare patterns.
- Integrate additional pollutants (Oβ, CO).
- Apply regression or machine learning models to quantify relationships.
π Requirements
Install dependencies with:
pip install -r requirements.txtMain packages:
pandasnumpymatplotlibrequests
π€ Author
Luca Scarpantonio
Data Wrangling with Python β Final Project