GitHunt
JO

jotstolu/Azure-Data-Engineering-End-to-End-Project-with-CI-CD-using-Azure-DevOps

This project demonstrates how to build a modern, scalable data pipeline in the cloud using Azure Data Factory, Azure DevOps, Delta Lake, and Databricks. The pipeline builds silver and gold layers with PySpark and Delta Live Tables, and implements continuous integration using DevOps.

Azure-Data-Engineering-End-to-End-Project-with-CI-CD-using-Azure-DevOps

This project demonstrates how to build a modern, scalable data pipeline in the cloud using Azure Data Factory, Azure DevOps, Delta Lake, and Databricks. The pipeline processes CSV datasets related to the Paris Olympics 2024, builds silver and gold layers with PySpark and Delta Live Tables, and implements continuous integration using DevOps.


🧭 Project Overview

Key features of the project:

  • Ingests multiple CSV files from a GitHub repo using Azure Data Factory
  • Implements dynamic Bronze β†’ Silver β†’ Gold layering in Databricks
  • Utilizes Delta Live Tables (DLT) for streaming and quality enforcement
  • Includes CI/CD with Azure DevOps Pipelines
  • Stores curated data in Azure Synapse Warehouse for downstream analytics

πŸ—ΊοΈ Architecture Overview

The architecture showcases an orchestrated data flow with CI/CD integration:


πŸ”„ Azure Data Factory Pipeline

This pipeline handles automated ingestion of CSV files from GitHub into ADLS Gen2 Bronze zone.

  • The LookupJson and ForEach logic dynamically controls file iteration.
  • Ingested data is stored in a Bronze container for transformation.

ADF Pipeline


πŸ§ͺ Silver Layer: Dynamic Data Ingestion in Databricks

The Silver layer consists of Databricks jobs that dynamically process multiple JSON/CSV files stored in ADLS Bronze.

  • Uses PySpark to parse JSON arrays
  • Converts raw Bronze data to clean Silver tables
  • Saves each transformed dataset to its respective Silver container

Dynamic Data Reading


πŸ₯‡ Gold Layer: Delta Live Tables

The Gold layer is built using Delta Live Tables (DLT).

  • Reads Silver tables via Spark Structured Streaming
  • Applies transformations and merges into star-schema-ready tables
  • DLT manages schema enforcement and handles incremental loads

DLT Pipeline


πŸ—‚οΈ Notebooks in This Project

olympics_project/
β”œβ”€β”€ json_notebook.python               # Parses and explodes JSON data
β”œβ”€β”€ Silver_Nocs.python                 # Silver table for countries
β”œβ”€β”€ Silver_Coaches & Events.python     # Silver table with nested JSON flattening
β”œβ”€β”€ silver_Athletes.python             # Athlete-level silver table
β”œβ”€β”€ Gold_Notebook.python               # DLT logic for the gold layer

βš™οΈ Technologies Used

  • Azure DevOps – CI/CD pipeline and automation
  • Azure Data Factory – Ingests raw CSV files from GitHub
  • Azure Data Lake Storage Gen2 – Layered data storage (Bronze, Silver, Gold)
  • Databricks – PySpark transformations, job orchestration
  • Delta Lake – ACID-compliant tables
  • Delta Live Tables (DLT) – Streaming transformations and data quality enforcement
  • Azure Synapse Analytics (Warehouse) – Final data consumption layer