GitHunt
AR

arturoburigo/spark-airflow-pipeline

๐Ÿš€ Production ETL pipeline with Apache Airflow, Spark & Azure Data Lake

๐Ÿš€ ETL Project with Apache Spark & Azure Data Lake

Python
Apache Spark
Apache Airflow
Azure
Docker
Delta Lake

Modern ETL Pipeline for Large-Scale Data Processing

๐Ÿ“– Complete Documentation โ€ข ๐Ÿš€ Quick Start โ€ข ๐Ÿ—๏ธ Architecture


๐Ÿ“‹ Table of Contents


๐Ÿ“– About the Project

This project implements a modern and scalable ETL pipeline that extracts data from a SQL Server database, processes and transforms the data using Apache Spark, and stores it in Azure Data Lake following the Medallion (Bronze, Silver, Gold) architecture. The entire process is orchestrated by Apache Airflow with Docker containerization.

๐ŸŽฏ Business Context

The project simulates a logistics and transportation system, with more than 200k registers of data accross multiple tables:

  • ๐Ÿ‘ฅ Customers and drivers
  • ๐Ÿš› Vehicles and fleets
  • ๐Ÿ“ฆ Deliveries and pickups
  • ๐Ÿ›ฃ๏ธ Routes and journeys
  • ๐Ÿ”ง Maintenance and fueling
  • ๐Ÿšจ Fines and violations

๐ŸŽฏ Objectives

  • โœ… Extract data from SQL Server efficiently
  • โœ… Store data in Azure Data Lake with organized layers
  • โœ… Process data with Apache Spark using Delta Lake
  • โœ… Transform data following best quality practices
  • โœ… Automate the entire pipeline with Apache Airflow
  • โœ… Monitor executions and performance
  • โœ… Implement dimensional model for analytics

๐Ÿ—๏ธ Architecture

image

๐Ÿ“Š Data Layers (Medallion)

  • ๐Ÿฅ‰ Bronze: Raw data in Delta format
  • ๐Ÿฅˆ Silver: Clean and standardized data
  • ๐Ÿฅ‡ Gold: Dimensional model and KPIs

๐Ÿš€ Running

๐Ÿ“‹ Prerequisites

Make sure you have installed:

Installation

  1. Clone the repository

    git clone https://github.com/arturoburigo/projeto_etl_spark
    cd projeto_etl_spark
  2. Start the SQL Server with pre-built data:

    docker run --platform linux/amd64 -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=satc@2025" -p 1433:1433 --name etl-deliveries-db -d arturoburigo/mssql-etl-deliveries-db:latest
  3. Set up Azure resources:

    • Create a Microsoft/Azure account with access to paid resources

    • In the Azure Portal, create a workspace following the Microsoft documentation

    • During this process, you will create a resource group. Save the resource group name as it will be used in the next step

    • Configure Azure:

    az login
    # Configure your credentials in the .env file
  4. Configure Terraform:

    • In the file /iac/variables.tf, modify the following variable by adding the resource group you created previously:

    image

  5. Deploy the cloud environment:

    cd iac
    terraform init
    terraform apply
  6. Verify Azure resources:

    • Check the Azure Portal for the MS SQL Server, MS SQL Database, and ADLS Gen2 containing the containers landing-zone, bronze, silver, and gold that were created in the previous step
  7. Generate SAS Token:

    • In the Azure Portal, generate a SAS TOKEN for the landing-zone container following this documentation
    • Save this token securely as it will be used in the next step
  8. Create environment files:

    • Create a .env file in the astro folder
  9. Configure environment variables:

    • Fill in the variables in both .env files with your Azure credentials and SAS token
  10. Set up Python environment:

    poetry env activate
    poetry install
  11. Start Airflow:

    cd astro
    astro dev start
  12. Execute the pipeline:

    • Navigate to the DAG Medallion Architecture - ETL"
    • Click "Trigger DAG"

๐Ÿ”ง Configuration

๐Ÿ” Environment Variables

Create a .env file based on .env.example:

# Azure Data Lake
ADLS_ACCOUNT_NAME=your_storage_account
ADLS_FILE_SYSTEM_NAME=landing
ADLS_BRONZE_CONTAINER_NAME=bronze
ADLS_SILVER_CONTAINER_NAME=silver
ADLS_GOLD_CONTAINER_NAME=gold
ADLS_SAS_TOKEN=your_sas_token

# SQL Server
SQL_SERVER=your_server.database.windows.net
SQL_DATABASE=your_database
SQL_SCHEMA=dbo
SQL_USERNAME=your_username
SQL_PASSWORD=your_password

# Spark Configuration
SPARK_DRIVER_MEMORY=4g
SPARK_EXECUTOR_MEMORY=4g
SPARK_EXECUTOR_CORES=2

๐Ÿ“Š Data Pipeline

๐Ÿ”„ Execution Flow

  1. ๐Ÿ” Landing Zone: Extract data from SQL Server to CSV
  2. ๐Ÿฅ‰ Bronze Layer: Ingest CSVs in Delta format
  3. ๐Ÿฅˆ Silver Layer: Clean, standardize, and ensure data quality
  4. ๐Ÿฅ‡ Gold Layer: Create dimensional model and calculate KPIs

๐Ÿค Contributing

Contributions are always welcome! Follow these steps:

  1. Fork the project
  2. Create a branch for your feature (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ‘ฅ Team

Arturo Burigo
Arturo Burigo

Airflow | Terraform | ETL
Luiz Bezerra
Luiz Bezerra

Bronze | Gold | BI
Gabriel Morona
Gabriel Morona

Silver | BI
Maria Laura
Maria Laura

Gold | Docs
Amanda Dimas
Amanda Dimas

Gold | SQL | Docs

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.