GitHunt
CH

chromahium/azure-databricks-stock-price-pipeline

This project ingests daily equity price data from the yfinance API and processes it through a medallion architecture using PySpark on Databricks. The pipeline is orchestrated with Databricks Jobs and stores all intermediate and final datasets in Azure Data Lake Storage (ADLS).

Intro

I built this project to push myself beyond isolated scripts and into something closer to a real-world data platform. I wanted a single project that would challenge me to think end-to-end: ingesting external data, processing it at scale, and making it usable for downstream analysis.

The focus was deliberately on building cloud and data engineering fundamentals — working with Azure Databricks, Apache Spark, and ADLS — while improving my PySpark skills and understanding how production-style pipelines are orchestrated using Databricks Jobs. Rather than over-engineering, the goal was to build something simple, understandable, and complete.

Pipeline Overview

I've designed an end-to-end OHLCV data pipeline implemented in Azure Databricks, following a medallion architecture. The pipeline is orchestrated using Databricks Jobs, with each layer implemented as a separate notebook.

The goal of the pipeline is to ingest raw market data, standardise it for analytics using PySpark, and derive consumer-ready metrics.


Data Source

  • Access Method: yfinance Python library
  • Data: Daily OHLCV (Open, High, Low, Close, Volume)
  • Focus: MAG7 equities (AAPL, MSFT, AMZN, GOOG, META, TSLA, NVDA)

Bronze Layer — Raw Ingestion

Notebook: bronze_ohlcv.ipynb

Responsibilities

  • Fetch raw OHLCV data from the external API
  • Apply minimal structural normalisation
  • Persist raw data without analytical assumptions

Key Steps

  • Download OHLCV data for all tickers in a single request
  • Flatten multi-index columns into snake_case names
  • Add a Unix timestamp derived from the source datetime index
  • Serialise the dataset to JSON (records orientation)
  • Write raw data to ADLS Bronze, versioned by execution date

Output

  • Storage: ADLS (Bronze container)

  • Format: JSON

  • Characteristics:

    • Append-only
    • Minimal transformation
    • Preserves source fidelity

Silver Layer — Cleaning & Reshaping

Notebook: silver_ohlcv.ipynb

Responsibilities

  • Convert raw data into an analytics-friendly schema
  • Enforce basic structural consistency
  • Prepare data for downstream aggregations

Key Steps

  • Read raw JSON data from the Bronze layer
  • Convert Unix timestamps to calendar dates
  • Reshape data from wide format (one column per ticker) to long format (one row per ticker per day)
  • Standardise column names and data types
  • Write curated data to ADLS Silver in columnar format

Output

  • Storage: ADLS (Silver container)
  • Format: Parquet

Gold Layer — Enrichment & Metrics

Notebook: gold_ohlcv.ipynb

Responsibilities

  • Derive simple, domain-relevant metrics
  • Produce a consumer-ready dataset

Key Steps

  • Read curated Silver data

  • Derive price range (high - low)

  • Derive price direction indicator:

    • -1 → price down
    • 0 → no change
    • 1 → price up
  • Write enriched data to ADLS Gold, versioned by execution date

Output

  • Storage: ADLS (Gold container)

  • Format: Parquet

  • Use Cases:

    • Exploratory analysis
    • Feature generation
    • Downstream analytics

Orchestration & Scheduling

  • Implemented using Databricks Jobs

  • Notebook-based tasks with explicit dependencies:

    • Silver depends on Bronze
    • Gold depends on Silver
  • Scheduled as a daily batch pipeline


Design Notes

  • The pipeline assumes a single successful run per day.
  • Deduplication and rerun are not implemented.
  • Schema enforcement and data quality checks are intentionally minimal.

Completed 06/01/2025