chromahium/azure-databricks-stock-price-pipeline
This project ingests daily equity price data from the yfinance API and processes it through a medallion architecture using PySpark on Databricks. The pipeline is orchestrated with Databricks Jobs and stores all intermediate and final datasets in Azure Data Lake Storage (ADLS).
Intro
I built this project to push myself beyond isolated scripts and into something closer to a real-world data platform. I wanted a single project that would challenge me to think end-to-end: ingesting external data, processing it at scale, and making it usable for downstream analysis.
The focus was deliberately on building cloud and data engineering fundamentals — working with Azure Databricks, Apache Spark, and ADLS — while improving my PySpark skills and understanding how production-style pipelines are orchestrated using Databricks Jobs. Rather than over-engineering, the goal was to build something simple, understandable, and complete.
Pipeline Overview
I've designed an end-to-end OHLCV data pipeline implemented in Azure Databricks, following a medallion architecture. The pipeline is orchestrated using Databricks Jobs, with each layer implemented as a separate notebook.
The goal of the pipeline is to ingest raw market data, standardise it for analytics using PySpark, and derive consumer-ready metrics.
Data Source
- Access Method:
yfinancePython library - Data: Daily OHLCV (Open, High, Low, Close, Volume)
- Focus: MAG7 equities (AAPL, MSFT, AMZN, GOOG, META, TSLA, NVDA)
Bronze Layer — Raw Ingestion
Notebook: bronze_ohlcv.ipynb
Responsibilities
- Fetch raw OHLCV data from the external API
- Apply minimal structural normalisation
- Persist raw data without analytical assumptions
Key Steps
- Download OHLCV data for all tickers in a single request
- Flatten multi-index columns into snake_case names
- Add a Unix timestamp derived from the source datetime index
- Serialise the dataset to JSON (records orientation)
- Write raw data to ADLS Bronze, versioned by execution date
Output
-
Storage: ADLS (Bronze container)
-
Format: JSON
-
Characteristics:
- Append-only
- Minimal transformation
- Preserves source fidelity
Silver Layer — Cleaning & Reshaping
Notebook: silver_ohlcv.ipynb
Responsibilities
- Convert raw data into an analytics-friendly schema
- Enforce basic structural consistency
- Prepare data for downstream aggregations
Key Steps
- Read raw JSON data from the Bronze layer
- Convert Unix timestamps to calendar dates
- Reshape data from wide format (one column per ticker) to long format (one row per ticker per day)
- Standardise column names and data types
- Write curated data to ADLS Silver in columnar format
Output
- Storage: ADLS (Silver container)
- Format: Parquet
Gold Layer — Enrichment & Metrics
Notebook: gold_ohlcv.ipynb
Responsibilities
- Derive simple, domain-relevant metrics
- Produce a consumer-ready dataset
Key Steps
-
Read curated Silver data
-
Derive price range (
high - low) -
Derive price direction indicator:
-1→ price down0→ no change1→ price up
-
Write enriched data to ADLS Gold, versioned by execution date
Output
-
Storage: ADLS (Gold container)
-
Format: Parquet
-
Use Cases:
- Exploratory analysis
- Feature generation
- Downstream analytics
Orchestration & Scheduling
-
Implemented using Databricks Jobs
-
Notebook-based tasks with explicit dependencies:
- Silver depends on Bronze
- Gold depends on Silver
-
Scheduled as a daily batch pipeline
Design Notes
- The pipeline assumes a single successful run per day.
- Deduplication and rerun are not implemented.
- Schema enforcement and data quality checks are intentionally minimal.
