benitomartin/de-ch-weather
Swiss Air Quality Data Engineering Pipeline
Swiss Air Quality Index
Below, you can find the project descrition to understand the content and setup instructions. The data used has been extracted from the Google Air Quality API.
Feel free to ⭐ and clone this repo 😉
Tech Stack
- Data Analysis & Exploration: SQL/Python
- Cloud: Google Cloud Platform
- Data Lake - Google Cloud Storage
- Data Warehouse: BigQuery
- Infrastructure as Code (IaC): Terraform
- Data ingestion (Batch/Workflow Orchestration): Mage
- Data Transformation: dbt
- Data Visualization: Looker Studio
- CI/CD: dbt
Project Structure
The project has been structured with the following folders and files:
mage:Workflow orchestration pipelinedbt:Data transformation and CI/CD pipelinelooker:Report from Looker Studioterraform:IaC stream-based pipeline infrastructure in GCP using Terraformrequirements.txt:Project requirementsimages:Printouts of results
Project Description
The dataset was obtained from Google Air Quality API and contains various columns with air quality data for a specific list of Swiss cities. To prepare the data some preprocessing steps were conducted. The following actions were performed using Mage to get a clean dataset. This Medium article was used as reference to understand the API and extract the data. A big shotout to Robert Martin-Short for the tutorial.
- Extract the relevant pollutants and air quality index (AQI) from the API
- Create the columns with the selected cities, latitude, and longitude
- Remove rows with NaN values
- Remove duplicates
Afterward, the final clean data are ingested to a GCP Bucket and Big Query. Finally, transformations are perfomed using dbt (see dbt folder) to get the production-ready data for dashboarding using Looker.
Mage Data Ingestion
The following picture shows two pipelines used to send the data to the Google Cloud bucket. It can be sent either directly to the bucket or to a partitioned folder inside the bucket containing the year/month/day structure. The latter approach is taken so that the file can be updated on a daily basis and the data from previous days are kept. Finally, the data is sent from the bucket to BigQuery.
dbt Data Transformation
Once the data is in BigQuery, a complete transformation step is performed using dbt to have the final clean dataset again in BigQuery. Four datasets are generated with dbt, two staging, and two production, each having the air quality data from all cities and from one city, which can be changed in dbt accordingly. The dataset called prod_all_cities is the one taken for the dashboard visualization in Looker. For the deployment in Github a CI/CD Check was run in dbt with the command dbt build --select state:modified+
Visualization
CI/CD
Finally, to streamline the development process, a fully automated Build and CI/CD pipeline was created using dbt:
dbt Build
dbt CI/CD
Project Set Up
The Python version used for this project is Python 3.10.
-
Clone the repo (or download it as zip):
git clone https://github.com/benitomartin/de-ch-weather.git
-
Create the virtual environment named
main-envusing Conda with Python version 3.10:conda create -n main-env python=3.10 conda activate main-env
-
Execute the
requirements.txtscript and install the project dependencies:pip install -r requirements.txt
-
Install terraform:
conda install -c conda-forge terraform
Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a GCP Account, credentials, and proper IAM roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:
- BigQuery Data Editor
- BigQuery Job User
- BigQuery User
- BigQuery Admin
- Storage Admin
- Compute Admin
To run the project, first go to the Terraform folder to create the infrastucture, then to Mage to get the data and store them in the GCS Bucket and Big Query and finally to dbt, to transform the data.






