Swiss Air Quality Index

Below, you can find the project descrition to understand the content and setup instructions. The data used has been extracted from the Google Air Quality API.

Feel free to ⭐ and clone this repo 😉

Tech Stack

Data Analysis & Exploration: SQL/Python
Cloud: Google Cloud Platform
- Data Lake - Google Cloud Storage
- Data Warehouse: BigQuery
Infrastructure as Code (IaC): Terraform
Data ingestion (Batch/Workflow Orchestration): Mage
Data Transformation: dbt
Data Visualization: Looker Studio
CI/CD: dbt

Project Structure

The project has been structured with the following folders and files:

mage: Workflow orchestration pipeline
dbt: Data transformation and CI/CD pipeline
looker: Report from Looker Studio
terraform: IaC stream-based pipeline infrastructure in GCP using Terraform
requirements.txt: Project requirements
images: Printouts of results

Project Description

The dataset was obtained from Google Air Quality API and contains various columns with air quality data for a specific list of Swiss cities. To prepare the data some preprocessing steps were conducted. The following actions were performed using Mage to get a clean dataset. This Medium article was used as reference to understand the API and extract the data. A big shotout to Robert Martin-Short for the tutorial.

Extract the relevant pollutants and air quality index (AQI) from the API
Create the columns with the selected cities, latitude, and longitude
Remove rows with NaN values
Remove duplicates

Afterward, the final clean data are ingested to a GCP Bucket and Big Query. Finally, transformations are perfomed using dbt (see dbt folder) to get the production-ready data for dashboarding using Looker.

Mage Data Ingestion

The following picture shows two pipelines used to send the data to the Google Cloud bucket. It can be sent either directly to the bucket or to a partitioned folder inside the bucket containing the year/month/day structure. The latter approach is taken so that the file can be updated on a daily basis and the data from previous days are kept. Finally, the data is sent from the bucket to BigQuery.

dbt Data Transformation

Once the data is in BigQuery, a complete transformation step is performed using dbt to have the final clean dataset again in BigQuery. Four datasets are generated with dbt, two staging, and two production, each having the air quality data from all cities and from one city, which can be changed in dbt accordingly. The dataset called prod_all_cities is the one taken for the dashboard visualization in Looker. For the deployment in Github a CI/CD Check was run in dbt with the command dbt build --select state:modified+

Visualization

CI/CD

Finally, to streamline the development process, a fully automated Build and CI/CD pipeline was created using dbt:

dbt Build

dbt CI/CD

Project Set Up

The Python version used for this project is Python 3.10.

Clone the repo (or download it as zip):

git clone https://github.com/benitomartin/de-ch-weather.git

Create the virtual environment named main-env using Conda with Python version 3.10:
```
conda create -n main-env python=3.10
conda activate main-env
```
Execute the requirements.txt script and install the project dependencies:
```
pip install -r requirements.txt
```
Install terraform:
```
conda install -c conda-forge terraform
```

Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a GCP Account, credentials, and proper IAM roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:

BigQuery Data Editor
BigQuery Job User
BigQuery User
BigQuery Admin
Storage Admin
Compute Admin

To run the project, first go to the Terraform folder to create the infrastucture, then to Mage to get the data and store them in the GCS Bucket and Big Query and finally to dbt, to transform the data.

benitomartin/de-ch-weather