NFL Movement Prediction - Medallion Architecture Pipeline

A Databricks-based medallion architecture pipeline for the NFL Big Data Bowl 2026 prediction competition using PySpark, SQL, SparkML, and Azure ADLS Gen2. This project implements a complete data engineering and machine learning pipeline orchestrated through Databricks Workflows.

🏗️ Architecture

The pipeline follows the Medallion Architecture pattern, organizing data into three quality layers:

🥉 Bronze Layer (01-Bronze.py): Raw data ingestion from Kaggle, preserving original data with audit columns
🥈 Silver Layer (02-Silver.py, 02-Silver-EDA.py): Data cleaning, quality checks, and exploratory data analysis
🥇 Gold Layer (03-Gold.py): Feature engineering, aggregations, and ML-ready datasets
🤖 ML Layer (04.1-ML.py): Model training, evaluation, and prediction using SparkML

Pipeline Flow

Kaggle Dataset → Bronze (Raw) → Silver (Cleaned) → Gold (Features) → ML (Models)

📋 Prerequisites

Databricks Workspace (with appropriate cluster configuration)
Azure Data Lake Storage Gen2 account with container created
Kaggle API credentials (username and API key)
Python 3.x with PySpark support
Databricks Runtime (recommended: 13.3 LTS or later)

🚀 Setup Instructions

1. Configure Azure ADLS

Copy 00-Config-example.py to 00-Config.py:
```
cp 00-Config-example.py 00-Config.py
```
Edit 00-Config.py and fill in your Azure Storage Account details:
- BLOB_ACCOUNT_NAME: Your Azure Storage Account name
- BLOB_CONTAINER_NAME: Container name (default: nfl-container)
- BLOB_SAS_TOKEN: Your SAS token with read/write permissions
Note: The 00-Config.py file is excluded from Git via .gitignore to protect your credentials.

2. Configure Kaggle API

Option A: Using Databricks Secrets (Recommended)

In Databricks, navigate to Workspace → Users → Your User → Secrets (or use Databricks CLI)

Create a secret scope named kaggle:

databricks secrets create-scope --scope kaggle

Add two secrets:

databricks secrets put --scope kaggle --key username
databricks secrets put --scope kaggle --key api_key

Update 01-Bronze.py to use secrets:

kaggle_token = {
    "username": dbutils.secrets.get(scope="kaggle", key="username"),
    "key": dbutils.secrets.get(scope="kaggle", key="api_key")
}

Option B: Environment Variables

Alternatively, you can modify 01-Bronze.py to read from environment variables or a secure config file.

3. Upload Notebooks to Databricks

Upload all notebooks to your Databricks workspace
Update the %run paths in each notebook to match your workspace structure:
- Current path: /Workspace/Users/pablo.peralta@upb.edu.co/medallion-automated-nfl/00-Config
- Update to: /Workspace/Users/your.email@domain.com/medallion-automated-nfl/00-Config
- Or use relative paths: ./00-Config (if notebooks are in the same directory)
Ensure all notebooks are in the same directory structure

4. Configure Databricks Workflow

In Databricks, go to Workflows → Create Workflow
Add tasks for each notebook in order:
- Task 1: 00-Config (Configuration)
- Task 2: 01-Bronze (Data Ingestion) - depends on Task 1
- Task 3: 02-Silver (Data Cleaning) - depends on Task 2
- Task 4: 03-Gold (Feature Engineering) - depends on Task 3
- Task 5: 04.1-ML (Model Training) - depends on Task 4
Configure cluster settings:
- Cluster Mode: Standard or High Concurrency
- Databricks Runtime: 13.3 LTS or later
- Node Type: Choose based on data volume (e.g., Standard_DS3_v2 or larger)
- Autoscaling: Enabled (recommended)
Set up schedule (optional):
- Configure trigger for periodic runs if needed
- Or run manually for one-time execution

5. Verify Setup

Run 00-Config.py interactively to verify Azure ADLS connection
Check that paths are correctly configured
Verify that the container exists in your Azure Storage Account

📁 Project Structure

medallion-automated-nfl/
├── 00-Config.py              # Configuration (create from 00-Config-example.py)
├── 00-Config-example.py      # Template configuration file
├── 01-Bronze.py              # Bronze layer: Data ingestion from Kaggle
├── 02-Silver-EDA.py          # Silver layer: Exploratory data analysis
├── 02-Silver.py              # Silver layer: Data cleaning and transformation
├── 03-Gold.py                # Gold layer: Feature engineering and aggregation
├── 04.1-ML.py                # ML layer: Model training and prediction
├── image.png                 # Pipeline workflow diagram
├── .gitignore                # Git ignore rules (excludes secrets)
└── README.md                 # This file

🔧 Technologies Used

PySpark: Distributed data processing and transformations
Delta Lake: ACID transactions, time travel, and schema evolution
SparkML: Machine learning library for distributed ML
Azure ADLS Gen2: Scalable data lake storage with hierarchical namespace
Databricks Workflows: Pipeline orchestration and scheduling
Kaggle API: Automated dataset download
SQL: Data queries and aggregations

📊 Data Flow

Bronze Layer

Downloads NFL Big Data Bowl 2026 dataset from Kaggle
Ingests raw CSV files into Delta format
Adds audit columns: ingestion_timestamp, source_file, row_hash
Partitions by season and week
Stores in: abfss://.../bronze/

Silver Layer

Cleans and validates data quality
Removes duplicates using row hashes
Handles missing values
Performs exploratory data analysis
Stores cleaned data in: abfss://.../silver/

Gold Layer

Creates engineered features:
- Physical projections (velocity decomposition, kinematics)
- Distance and proximity features
- Aggregated statistics per play
Prepares ML-ready datasets
Stores in: abfss://.../gold/

ML Layer

Trains multiple models (GBT, Random Forest, Decision Tree)
Uses cross-validation for hyperparameter tuning
Tracks experiments with MLflow
Generates predictions for test set

🔐 Security Best Practices

Never commit secrets: The 00-Config.py file is excluded via .gitignore
Use Databricks Secrets: Store sensitive credentials in Databricks secret scopes
Rotate credentials: If credentials are exposed, immediately rotate them
SAS Token permissions: Use least-privilege SAS tokens with appropriate expiration dates
Review access: Regularly audit who has access to your Databricks workspace and Azure resources

📝 Notes

The pipeline is designed to run in Databricks Workflows but can also run interactively
All data is stored in Delta format for better performance, reliability, and time travel capabilities
The medallion architecture ensures data quality, traceability, and reproducibility
Notebooks use %run magic command to share configuration across the pipeline
Task values are used to pass paths between workflow tasks

🐛 Troubleshooting

Common Issues

Azure ADLS Connection Failed
- Verify SAS token is valid and has correct permissions
- Check that storage account name is correct
- Ensure container exists
Kaggle API Authentication Error
- Verify credentials are correct
- Check that Kaggle API key has proper permissions
- Ensure /root/.kaggle/kaggle.json has correct permissions (600)
Path Not Found Errors
- Verify workspace paths in %run commands
- Ensure all notebooks are uploaded to Databricks
- Check that paths match your workspace structure
Cluster Configuration Issues
- Ensure cluster has sufficient memory for data volume
- Check that Databricks Runtime version supports required libraries
- Verify autoscaling is configured appropriately

📚 Additional Resources

📄 License

This project is for educational purposes as part of the NFL Big Data Bowl 2026 competition.

👤 Author

Pablo Peralta - NFL Big Data Bowl 2026 Competition Entry

⚠️ Important: Before pushing to GitHub, ensure that:

00-Config.py is in .gitignore (already configured)
All secrets are removed from committed files
You've rotated any exposed credentials
The 00-Config-example.py template is available for others to use

1pperalta/medallion-automated-nfl