GitHunt
1P

1pperalta/medallion-automated-nfl

Databricks medallion architecture pipeline for NFL Big Data Bowl 2026 prediction using PySpark, Delta Lake, SparkML, and Azure ADLS Gen2

NFL Movement Prediction - Medallion Architecture Pipeline

A Databricks-based medallion architecture pipeline for the NFL Big Data Bowl 2026 prediction competition using PySpark, SQL, SparkML, and Azure ADLS Gen2. This project implements a complete data engineering and machine learning pipeline orchestrated through Databricks Workflows.

๐Ÿ—๏ธ Architecture

The pipeline follows the Medallion Architecture pattern, organizing data into three quality layers:

  • ๐Ÿฅ‰ Bronze Layer (01-Bronze.py): Raw data ingestion from Kaggle, preserving original data with audit columns
  • ๐Ÿฅˆ Silver Layer (02-Silver.py, 02-Silver-EDA.py): Data cleaning, quality checks, and exploratory data analysis
  • ๐Ÿฅ‡ Gold Layer (03-Gold.py): Feature engineering, aggregations, and ML-ready datasets
  • ๐Ÿค– ML Layer (04.1-ML.py): Model training, evaluation, and prediction using SparkML

Pipeline Flow

Kaggle Dataset โ†’ Bronze (Raw) โ†’ Silver (Cleaned) โ†’ Gold (Features) โ†’ ML (Models)

๐Ÿ“‹ Prerequisites

  • Databricks Workspace (with appropriate cluster configuration)
  • Azure Data Lake Storage Gen2 account with container created
  • Kaggle API credentials (username and API key)
  • Python 3.x with PySpark support
  • Databricks Runtime (recommended: 13.3 LTS or later)

๐Ÿš€ Setup Instructions

1. Configure Azure ADLS

  1. Copy 00-Config-example.py to 00-Config.py:

    cp 00-Config-example.py 00-Config.py
  2. Edit 00-Config.py and fill in your Azure Storage Account details:

    • BLOB_ACCOUNT_NAME: Your Azure Storage Account name
    • BLOB_CONTAINER_NAME: Container name (default: nfl-container)
    • BLOB_SAS_TOKEN: Your SAS token with read/write permissions

    Note: The 00-Config.py file is excluded from Git via .gitignore to protect your credentials.

2. Configure Kaggle API

  1. In Databricks, navigate to Workspace โ†’ Users โ†’ Your User โ†’ Secrets (or use Databricks CLI)
  2. Create a secret scope named kaggle:
    databricks secrets create-scope --scope kaggle
  3. Add two secrets:
    databricks secrets put --scope kaggle --key username
    databricks secrets put --scope kaggle --key api_key
  4. Update 01-Bronze.py to use secrets:
    kaggle_token = {
        "username": dbutils.secrets.get(scope="kaggle", key="username"),
        "key": dbutils.secrets.get(scope="kaggle", key="api_key")
    }

Option B: Environment Variables

Alternatively, you can modify 01-Bronze.py to read from environment variables or a secure config file.

3. Upload Notebooks to Databricks

  1. Upload all notebooks to your Databricks workspace

  2. Update the %run paths in each notebook to match your workspace structure:

    • Current path: /Workspace/Users/pablo.peralta@upb.edu.co/medallion-automated-nfl/00-Config
    • Update to: /Workspace/Users/your.email@domain.com/medallion-automated-nfl/00-Config
    • Or use relative paths: ./00-Config (if notebooks are in the same directory)
  3. Ensure all notebooks are in the same directory structure

4. Configure Databricks Workflow

  1. In Databricks, go to Workflows โ†’ Create Workflow

  2. Add tasks for each notebook in order:

    • Task 1: 00-Config (Configuration)
    • Task 2: 01-Bronze (Data Ingestion) - depends on Task 1
    • Task 3: 02-Silver (Data Cleaning) - depends on Task 2
    • Task 4: 03-Gold (Feature Engineering) - depends on Task 3
    • Task 5: 04.1-ML (Model Training) - depends on Task 4
  3. Configure cluster settings:

    • Cluster Mode: Standard or High Concurrency
    • Databricks Runtime: 13.3 LTS or later
    • Node Type: Choose based on data volume (e.g., Standard_DS3_v2 or larger)
    • Autoscaling: Enabled (recommended)
  4. Set up schedule (optional):

    • Configure trigger for periodic runs if needed
    • Or run manually for one-time execution

5. Verify Setup

  1. Run 00-Config.py interactively to verify Azure ADLS connection
  2. Check that paths are correctly configured
  3. Verify that the container exists in your Azure Storage Account

๐Ÿ“ Project Structure

medallion-automated-nfl/
โ”œโ”€โ”€ 00-Config.py              # Configuration (create from 00-Config-example.py)
โ”œโ”€โ”€ 00-Config-example.py      # Template configuration file
โ”œโ”€โ”€ 01-Bronze.py              # Bronze layer: Data ingestion from Kaggle
โ”œโ”€โ”€ 02-Silver-EDA.py          # Silver layer: Exploratory data analysis
โ”œโ”€โ”€ 02-Silver.py              # Silver layer: Data cleaning and transformation
โ”œโ”€โ”€ 03-Gold.py                # Gold layer: Feature engineering and aggregation
โ”œโ”€โ”€ 04.1-ML.py                # ML layer: Model training and prediction
โ”œโ”€โ”€ image.png                 # Pipeline workflow diagram
โ”œโ”€โ”€ .gitignore                # Git ignore rules (excludes secrets)
โ””โ”€โ”€ README.md                 # This file

๐Ÿ”ง Technologies Used

  • PySpark: Distributed data processing and transformations
  • Delta Lake: ACID transactions, time travel, and schema evolution
  • SparkML: Machine learning library for distributed ML
  • Azure ADLS Gen2: Scalable data lake storage with hierarchical namespace
  • Databricks Workflows: Pipeline orchestration and scheduling
  • Kaggle API: Automated dataset download
  • SQL: Data queries and aggregations

๐Ÿ“Š Data Flow

Bronze Layer

  • Downloads NFL Big Data Bowl 2026 dataset from Kaggle
  • Ingests raw CSV files into Delta format
  • Adds audit columns: ingestion_timestamp, source_file, row_hash
  • Partitions by season and week
  • Stores in: abfss://.../bronze/

Silver Layer

  • Cleans and validates data quality
  • Removes duplicates using row hashes
  • Handles missing values
  • Performs exploratory data analysis
  • Stores cleaned data in: abfss://.../silver/

Gold Layer

  • Creates engineered features:
    • Physical projections (velocity decomposition, kinematics)
    • Distance and proximity features
    • Aggregated statistics per play
  • Prepares ML-ready datasets
  • Stores in: abfss://.../gold/

ML Layer

  • Trains multiple models (GBT, Random Forest, Decision Tree)
  • Uses cross-validation for hyperparameter tuning
  • Tracks experiments with MLflow
  • Generates predictions for test set

๐Ÿ” Security Best Practices

  1. Never commit secrets: The 00-Config.py file is excluded via .gitignore
  2. Use Databricks Secrets: Store sensitive credentials in Databricks secret scopes
  3. Rotate credentials: If credentials are exposed, immediately rotate them
  4. SAS Token permissions: Use least-privilege SAS tokens with appropriate expiration dates
  5. Review access: Regularly audit who has access to your Databricks workspace and Azure resources

๐Ÿ“ Notes

  • The pipeline is designed to run in Databricks Workflows but can also run interactively
  • All data is stored in Delta format for better performance, reliability, and time travel capabilities
  • The medallion architecture ensures data quality, traceability, and reproducibility
  • Notebooks use %run magic command to share configuration across the pipeline
  • Task values are used to pass paths between workflow tasks

๐Ÿ› Troubleshooting

Common Issues

  1. Azure ADLS Connection Failed

    • Verify SAS token is valid and has correct permissions
    • Check that storage account name is correct
    • Ensure container exists
  2. Kaggle API Authentication Error

    • Verify credentials are correct
    • Check that Kaggle API key has proper permissions
    • Ensure /root/.kaggle/kaggle.json has correct permissions (600)
  3. Path Not Found Errors

    • Verify workspace paths in %run commands
    • Ensure all notebooks are uploaded to Databricks
    • Check that paths match your workspace structure
  4. Cluster Configuration Issues

    • Ensure cluster has sufficient memory for data volume
    • Check that Databricks Runtime version supports required libraries
    • Verify autoscaling is configured appropriately

๐Ÿ“š Additional Resources

๐Ÿ“„ License

This project is for educational purposes as part of the NFL Big Data Bowl 2026 competition.

๐Ÿ‘ค Author

Pablo Peralta - NFL Big Data Bowl 2026 Competition Entry


โš ๏ธ Important: Before pushing to GitHub, ensure that:

  • 00-Config.py is in .gitignore (already configured)
  • All secrets are removed from committed files
  • You've rotated any exposed credentials
  • The 00-Config-example.py template is available for others to use