🚀 ETL Project with Apache Spark & Azure Data Lake

Modern ETL Pipeline for Large-Scale Data Processing

📖 Complete Documentation • 🚀 Quick Start • 🏗️ Architecture

📋 Table of Contents

📖 About the Project
🎯 Objectives
🏗️ Architecture
🚀 Running
🔧 Configuration
📊 Data Pipeline
🤝 Contributing
👥 Team
📄 License

📖 About the Project

This project implements a modern and scalable ETL pipeline that extracts data from a SQL Server database, processes and transforms the data using Apache Spark, and stores it in Azure Data Lake following the Medallion (Bronze, Silver, Gold) architecture. The entire process is orchestrated by Apache Airflow with Docker containerization.

🎯 Business Context

The project simulates a logistics and transportation system, with more than 200k registers of data accross multiple tables:

👥 Customers and drivers
🚛 Vehicles and fleets
📦 Deliveries and pickups
🛣️ Routes and journeys
🔧 Maintenance and fueling
🚨 Fines and violations

🎯 Objectives

✅ Extract data from SQL Server efficiently
✅ Store data in Azure Data Lake with organized layers
✅ Process data with Apache Spark using Delta Lake
✅ Transform data following best quality practices
✅ Automate the entire pipeline with Apache Airflow
✅ Monitor executions and performance
✅ Implement dimensional model for analytics

🏗️ Architecture

📊 Data Layers (Medallion)

🥉 Bronze: Raw data in Delta format
🥈 Silver: Clean and standardized data
🥇 Gold: Dimensional model and KPIs

🚀 Running

📋 Prerequisites

Make sure you have installed:

Installation

Clone the repository

git clone https://github.com/arturoburigo/projeto_etl_spark
cd projeto_etl_spark

Start the SQL Server with pre-built data:

docker run --platform linux/amd64 -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=satc@2025" -p 1433:1433 --name etl-deliveries-db -d arturoburigo/mssql-etl-deliveries-db:latest

Set up Azure resources:
- Create a Microsoft/Azure account with access to paid resources
- In the Azure Portal, create a workspace following the Microsoft documentation
- During this process, you will create a resource group. Save the resource group name as it will be used in the next step
- Configure Azure:
```
az login
# Configure your credentials in the .env file
```
Configure Terraform:
- In the file /iac/variables.tf, modify the following variable by adding the resource group you created previously:
Deploy the cloud environment:
```
cd iac
terraform init
terraform apply
```
Verify Azure resources:
- Check the Azure Portal for the MS SQL Server, MS SQL Database, and ADLS Gen2 containing the containers landing-zone, bronze, silver, and gold that were created in the previous step
Generate SAS Token:
- In the Azure Portal, generate a SAS TOKEN for the landing-zone container following this documentation
- Save this token securely as it will be used in the next step
Create environment files:
- Create a .env file in the astro folder
Configure environment variables:
- Fill in the variables in both .env files with your Azure credentials and SAS token
Set up Python environment:
```
poetry env activate
poetry install
```
Start Airflow:
```
cd astro
astro dev start
```
Execute the pipeline:
- Navigate to the DAG Medallion Architecture - ETL"
- Click "Trigger DAG"

🔧 Configuration

🔐 Environment Variables

Create a .env file based on .env.example:

# Azure Data Lake
ADLS_ACCOUNT_NAME=your_storage_account
ADLS_FILE_SYSTEM_NAME=landing
ADLS_BRONZE_CONTAINER_NAME=bronze
ADLS_SILVER_CONTAINER_NAME=silver
ADLS_GOLD_CONTAINER_NAME=gold
ADLS_SAS_TOKEN=your_sas_token

# SQL Server
SQL_SERVER=your_server.database.windows.net
SQL_DATABASE=your_database
SQL_SCHEMA=dbo
SQL_USERNAME=your_username
SQL_PASSWORD=your_password

# Spark Configuration
SPARK_DRIVER_MEMORY=4g
SPARK_EXECUTOR_MEMORY=4g
SPARK_EXECUTOR_CORES=2

📊 Data Pipeline

🔄 Execution Flow

🔍 Landing Zone: Extract data from SQL Server to CSV
🥉 Bronze Layer: Ingest CSVs in Delta format
🥈 Silver Layer: Clean, standardize, and ensure data quality
🥇 Gold Layer: Create dimensional model and calculate KPIs

🤝 Contributing

Contributions are always welcome! Follow these steps:

Fork the project
Create a branch for your feature (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

👥 Team

_{Arturo Burigo} _{Airflow \| Terraform \| ETL}	_{Luiz Bezerra} _{Bronze \| Gold \| BI}	_{Gabriel Morona} _{Silver \| BI}	_{Maria Laura} _{Gold \| Docs}	_{Amanda Dimas} _{Gold \| SQL \| Docs}