Intro

I built this project to push myself beyond isolated scripts and into something closer to a real-world data platform. I wanted a single project that would challenge me to think end-to-end: ingesting external data, processing it at scale, and making it usable for downstream analysis.

The focus was deliberately on building cloud and data engineering fundamentals — working with Azure Databricks, Apache Spark, and ADLS — while improving my PySpark skills and understanding how production-style pipelines are orchestrated using Databricks Jobs. Rather than over-engineering, the goal was to build something simple, understandable, and complete.

Pipeline Overview

I've designed an end-to-end OHLCV data pipeline implemented in Azure Databricks, following a medallion architecture. The pipeline is orchestrated using Databricks Jobs, with each layer implemented as a separate notebook.

The goal of the pipeline is to ingest raw market data, standardise it for analytics using PySpark, and derive consumer-ready metrics.

Data Source

Access Method: yfinance Python library
Data: Daily OHLCV (Open, High, Low, Close, Volume)
Focus: MAG7 equities (AAPL, MSFT, AMZN, GOOG, META, TSLA, NVDA)

Bronze Layer — Raw Ingestion

Notebook: bronze_ohlcv.ipynb

Responsibilities

Fetch raw OHLCV data from the external API
Apply minimal structural normalisation
Persist raw data without analytical assumptions

Key Steps

Download OHLCV data for all tickers in a single request
Flatten multi-index columns into snake_case names
Add a Unix timestamp derived from the source datetime index
Serialise the dataset to JSON (records orientation)
Write raw data to ADLS Bronze, versioned by execution date

Output

Storage: ADLS (Bronze container)
Format: JSON
Characteristics:
- Append-only
- Minimal transformation
- Preserves source fidelity

Silver Layer — Cleaning & Reshaping

Notebook: silver_ohlcv.ipynb

Responsibilities

Convert raw data into an analytics-friendly schema
Enforce basic structural consistency
Prepare data for downstream aggregations

Key Steps

Read raw JSON data from the Bronze layer
Convert Unix timestamps to calendar dates
Reshape data from wide format (one column per ticker) to long format (one row per ticker per day)
Standardise column names and data types
Write curated data to ADLS Silver in columnar format

Output

Storage: ADLS (Silver container)
Format: Parquet

Gold Layer — Enrichment & Metrics

Notebook: gold_ohlcv.ipynb

Responsibilities

Derive simple, domain-relevant metrics
Produce a consumer-ready dataset

Key Steps

Read curated Silver data
Derive price range (high - low)
Derive price direction indicator:
- -1 → price down
- 0 → no change
- 1 → price up
Write enriched data to ADLS Gold, versioned by execution date

Output

Storage: ADLS (Gold container)
Format: Parquet
Use Cases:
- Exploratory analysis
- Feature generation
- Downstream analytics

Orchestration & Scheduling

Implemented using Databricks Jobs
Notebook-based tasks with explicit dependencies:
- Silver depends on Bronze
- Gold depends on Silver
Scheduled as a daily batch pipeline

Design Notes

The pipeline assumes a single successful run per day.
Deduplication and rerun are not implemented.
Schema enforcement and data quality checks are intentionally minimal.

chromahium/azure-databricks-stock-price-pipeline

Intro

Pipeline Overview

Data Source

Bronze Layer — Raw Ingestion

Responsibilities

Key Steps

Output

Silver Layer — Cleaning & Reshaping

Responsibilities

Key Steps

Output

Gold Layer — Enrichment & Metrics

Responsibilities

Key Steps

Output

Orchestration & Scheduling

Design Notes

Completed 06/01/2025

On this page

Languages

Contributors