arturoburigo/spark-airflow-pipeline
๐ Production ETL pipeline with Apache Airflow, Spark & Azure Data Lake
๐ ETL Project with Apache Spark & Azure Data Lake
Modern ETL Pipeline for Large-Scale Data Processing
๐ Complete Documentation โข ๐ Quick Start โข ๐๏ธ Architecture
๐ Table of Contents
- ๐ About the Project
- ๐ฏ Objectives
- ๐๏ธ Architecture
- ๐ Running
- ๐ง Configuration
- ๐ Data Pipeline
- ๐ค Contributing
- ๐ฅ Team
- ๐ License
๐ About the Project
This project implements a modern and scalable ETL pipeline that extracts data from a SQL Server database, processes and transforms the data using Apache Spark, and stores it in Azure Data Lake following the Medallion (Bronze, Silver, Gold) architecture. The entire process is orchestrated by Apache Airflow with Docker containerization.
๐ฏ Business Context
The project simulates a logistics and transportation system, with more than 200k registers of data accross multiple tables:
- ๐ฅ Customers and drivers
- ๐ Vehicles and fleets
- ๐ฆ Deliveries and pickups
- ๐ฃ๏ธ Routes and journeys
- ๐ง Maintenance and fueling
- ๐จ Fines and violations
๐ฏ Objectives
- โ Extract data from SQL Server efficiently
- โ Store data in Azure Data Lake with organized layers
- โ Process data with Apache Spark using Delta Lake
- โ Transform data following best quality practices
- โ Automate the entire pipeline with Apache Airflow
- โ Monitor executions and performance
- โ Implement dimensional model for analytics
๐๏ธ Architecture
๐ Data Layers (Medallion)
- ๐ฅ Bronze: Raw data in Delta format
- ๐ฅ Silver: Clean and standardized data
- ๐ฅ Gold: Dimensional model and KPIs
๐ Running
๐ Prerequisites
Make sure you have installed:
- ๐ Python 3.10+
- ๐ณ Docker & Docker Compose
- โ๏ธ Azure CLI
- ๐ฆ Poetry
Installation
-
Clone the repository
git clone https://github.com/arturoburigo/projeto_etl_spark cd projeto_etl_spark -
Start the SQL Server with pre-built data:
docker run --platform linux/amd64 -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=satc@2025" -p 1433:1433 --name etl-deliveries-db -d arturoburigo/mssql-etl-deliveries-db:latest
-
Set up Azure resources:
-
Create a Microsoft/Azure account with access to paid resources
-
In the Azure Portal, create a workspace following the Microsoft documentation
-
During this process, you will create a resource group. Save the resource group name as it will be used in the next step
-
Configure Azure:
az login # Configure your credentials in the .env file -
-
Configure Terraform:
- In the file
/iac/variables.tf, modify the following variable by adding the resource group you created previously:
- In the file
-
Deploy the cloud environment:
cd iac terraform init terraform apply -
Verify Azure resources:
- Check the Azure Portal for the MS SQL Server, MS SQL Database, and ADLS Gen2 containing the containers
landing-zone,bronze,silver, andgoldthat were created in the previous step
- Check the Azure Portal for the MS SQL Server, MS SQL Database, and ADLS Gen2 containing the containers
-
Generate SAS Token:
- In the Azure Portal, generate a SAS TOKEN for the
landing-zonecontainer following this documentation - Save this token securely as it will be used in the next step
- In the Azure Portal, generate a SAS TOKEN for the
-
Create environment files:
- Create a
.envfile in theastrofolder
- Create a
-
Configure environment variables:
- Fill in the variables in both
.envfiles with your Azure credentials and SAS token
- Fill in the variables in both
-
Set up Python environment:
poetry env activate poetry install
-
Start Airflow:
cd astro astro dev start -
Execute the pipeline:
- Navigate to the DAG
Medallion Architecture - ETL" - Click "Trigger DAG"
- Navigate to the DAG
๐ง Configuration
๐ Environment Variables
Create a .env file based on .env.example:
# Azure Data Lake
ADLS_ACCOUNT_NAME=your_storage_account
ADLS_FILE_SYSTEM_NAME=landing
ADLS_BRONZE_CONTAINER_NAME=bronze
ADLS_SILVER_CONTAINER_NAME=silver
ADLS_GOLD_CONTAINER_NAME=gold
ADLS_SAS_TOKEN=your_sas_token
# SQL Server
SQL_SERVER=your_server.database.windows.net
SQL_DATABASE=your_database
SQL_SCHEMA=dbo
SQL_USERNAME=your_username
SQL_PASSWORD=your_password
# Spark Configuration
SPARK_DRIVER_MEMORY=4g
SPARK_EXECUTOR_MEMORY=4g
SPARK_EXECUTOR_CORES=2๐ Data Pipeline
๐ Execution Flow
- ๐ Landing Zone: Extract data from SQL Server to CSV
- ๐ฅ Bronze Layer: Ingest CSVs in Delta format
- ๐ฅ Silver Layer: Clean, standardize, and ensure data quality
- ๐ฅ Gold Layer: Create dimensional model and calculate KPIs
๐ค Contributing
Contributions are always welcome! Follow these steps:
- Fork the project
- Create a branch for your feature (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ฅ Team
![]() Arturo Burigo Airflow | Terraform | ETL |
![]() Luiz Bezerra Bronze | Gold | BI |
![]() Gabriel Morona Silver | BI |
![]() Maria Laura Gold | Docs |
![]() Amanda Dimas Gold | SQL | Docs |
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.






