SW
swapniltake1/adventure-works-data-engineering
End-to-end Azure Data Engineering project implementing Medallion Architecture using ADF, ADLS Gen2, Databricks (Spark), Synapse Analytics, and Power BI for Adventure Works analytics.
Adventure Works Data Engineering Project
This repository contains an end-to-end data engineering implementation from scratch using Azure services and Apache Spark.
The project ingests raw Adventure Works data, transforms it through a medallion-style architecture, serves it via Synapse, and delivers analytics to Power BI.
๐ Project Goal
Build a production-style modern data platform that:
- Ingests source files from Git/HTTP using Azure Data Factory (ADF).
- Stores raw and curated datasets in Azure Data Lake Storage Gen2 (ADLS).
- Performs scalable transformations using Azure Databricks + Apache Spark.
- Exposes analytics-ready data with Azure Synapse Analytics (external tables/views).
- Delivers business reporting in Power BI.
๐งฑ Tech Stack
- Azure Data Factory (or Synapse pipelines) for orchestration and ingestion
- Azure Data Lake Storage Gen2 for raw/bronze/silver/gold storage
- Azure Databricks for Spark-based transformation
- Apache Spark for distributed data processing
- Azure Synapse Analytics for SQL serving layer
- Power BI for dashboards and business reporting
๐ Repository Structure
.
โโโ pipeline/ # ADF pipeline definitions
โโโ dataset/ # Dataset definitions for source/sink
โโโ linkedService/ # Linked service connections
โโโ sqlscript/ # Synapse SQL scripts (schema, tables, views)
โโโ Data/ # Sample Adventure Works CSV files
โโโ factory/ # Factory definition
โโโ integrationRuntime/ # Integration runtime configuration
โโโ credential/ # Workspace identity credentials config
๐ End-to-End Workflow (Implemented)
- Source configuration
- Store source metadata (file URL, sink folder, file name) in a control JSON.
- Ingestion with ADF pipeline (
GitToRawData)LookupGitreads the metadata list.ForEachgititerates through each file entry.GitToRawcopies CSV files from HTTP/Git source into ADLS raw zone.
- Data lake layering
- Raw/bronze layer stores ingested files.
- Silver layer stores cleaned and standardized data.
- Gold layer stores analytics-ready curated data.
- Transformations (Databricks + Spark)
- Clean, cast, join, and enrich domain datasets.
- Apply business transformations for reporting consumption.
- Serving (Synapse SQL)
- Create schema and external tables over curated data.
- Build views for BI-friendly data access.
- Consumption (Power BI)
- Connect Power BI to Synapse (or curated data endpoint).
- Build reports and dashboards from gold views.
๐ ๏ธ How to Use This Repository
- Import these artifacts into your Azure Data Factory/Synapse workspace.
- Update linked service credentials and storage account references.
- Upload
Data/files (or configure Git/HTTP source as needed). - Trigger the
GitToRawDatapipeline. - Run Databricks notebooks for silver/gold transformations.
- Execute SQL scripts in
sqlscript/to create schema/tables/views. - Connect Power BI and build visual reports.