jotstolu/Azure-Data-Engineering-End-to-End-Project-with-CI-CD-using-Azure-DevOps
This project demonstrates how to build a modern, scalable data pipeline in the cloud using Azure Data Factory, Azure DevOps, Delta Lake, and Databricks. The pipeline builds silver and gold layers with PySpark and Delta Live Tables, and implements continuous integration using DevOps.
Azure-Data-Engineering-End-to-End-Project-with-CI-CD-using-Azure-DevOps
This project demonstrates how to build a modern, scalable data pipeline in the cloud using Azure Data Factory, Azure DevOps, Delta Lake, and Databricks. The pipeline processes CSV datasets related to the Paris Olympics 2024, builds silver and gold layers with PySpark and Delta Live Tables, and implements continuous integration using DevOps.
π§ Project Overview
Key features of the project:
- Ingests multiple CSV files from a GitHub repo using Azure Data Factory
- Implements dynamic Bronze β Silver β Gold layering in Databricks
- Utilizes Delta Live Tables (DLT) for streaming and quality enforcement
- Includes CI/CD with Azure DevOps Pipelines
- Stores curated data in Azure Synapse Warehouse for downstream analytics
πΊοΈ Architecture Overview
The architecture showcases an orchestrated data flow with CI/CD integration:
π Azure Data Factory Pipeline
This pipeline handles automated ingestion of CSV files from GitHub into ADLS Gen2 Bronze zone.
- The
LookupJsonandForEachlogic dynamically controls file iteration. - Ingested data is stored in a Bronze container for transformation.
π§ͺ Silver Layer: Dynamic Data Ingestion in Databricks
The Silver layer consists of Databricks jobs that dynamically process multiple JSON/CSV files stored in ADLS Bronze.
- Uses PySpark to parse JSON arrays
- Converts raw Bronze data to clean Silver tables
- Saves each transformed dataset to its respective Silver container
π₯ Gold Layer: Delta Live Tables
The Gold layer is built using Delta Live Tables (DLT).
- Reads Silver tables via Spark Structured Streaming
- Applies transformations and merges into star-schema-ready tables
- DLT manages schema enforcement and handles incremental loads
ποΈ Notebooks in This Project
olympics_project/
βββ json_notebook.python # Parses and explodes JSON data
βββ Silver_Nocs.python # Silver table for countries
βββ Silver_Coaches & Events.python # Silver table with nested JSON flattening
βββ silver_Athletes.python # Athlete-level silver table
βββ Gold_Notebook.python # DLT logic for the gold layer
βοΈ Technologies Used
- Azure DevOps β CI/CD pipeline and automation
- Azure Data Factory β Ingests raw CSV files from GitHub
- Azure Data Lake Storage Gen2 β Layered data storage (Bronze, Silver, Gold)
- Databricks β PySpark transformations, job orchestration
- Delta Lake β ACID-compliant tables
- Delta Live Tables (DLT) β Streaming transformations and data quality enforcement
- Azure Synapse Analytics (Warehouse) β Final data consumption layer


