1pperalta/medallion-automated-nfl
Databricks medallion architecture pipeline for NFL Big Data Bowl 2026 prediction using PySpark, Delta Lake, SparkML, and Azure ADLS Gen2
NFL Movement Prediction - Medallion Architecture Pipeline
A Databricks-based medallion architecture pipeline for the NFL Big Data Bowl 2026 prediction competition using PySpark, SQL, SparkML, and Azure ADLS Gen2. This project implements a complete data engineering and machine learning pipeline orchestrated through Databricks Workflows.
๐๏ธ Architecture
The pipeline follows the Medallion Architecture pattern, organizing data into three quality layers:
- ๐ฅ Bronze Layer (
01-Bronze.py): Raw data ingestion from Kaggle, preserving original data with audit columns - ๐ฅ Silver Layer (
02-Silver.py,02-Silver-EDA.py): Data cleaning, quality checks, and exploratory data analysis - ๐ฅ Gold Layer (
03-Gold.py): Feature engineering, aggregations, and ML-ready datasets - ๐ค ML Layer (
04.1-ML.py): Model training, evaluation, and prediction using SparkML
Pipeline Flow
Kaggle Dataset โ Bronze (Raw) โ Silver (Cleaned) โ Gold (Features) โ ML (Models)
๐ Prerequisites
- Databricks Workspace (with appropriate cluster configuration)
- Azure Data Lake Storage Gen2 account with container created
- Kaggle API credentials (username and API key)
- Python 3.x with PySpark support
- Databricks Runtime (recommended: 13.3 LTS or later)
๐ Setup Instructions
1. Configure Azure ADLS
-
Copy
00-Config-example.pyto00-Config.py:cp 00-Config-example.py 00-Config.py
-
Edit
00-Config.pyand fill in your Azure Storage Account details:BLOB_ACCOUNT_NAME: Your Azure Storage Account nameBLOB_CONTAINER_NAME: Container name (default:nfl-container)BLOB_SAS_TOKEN: Your SAS token with read/write permissions
Note: The
00-Config.pyfile is excluded from Git via.gitignoreto protect your credentials.
2. Configure Kaggle API
Option A: Using Databricks Secrets (Recommended)
- In Databricks, navigate to Workspace โ Users โ Your User โ Secrets (or use Databricks CLI)
- Create a secret scope named
kaggle:databricks secrets create-scope --scope kaggle
- Add two secrets:
databricks secrets put --scope kaggle --key username databricks secrets put --scope kaggle --key api_key
- Update
01-Bronze.pyto use secrets:kaggle_token = { "username": dbutils.secrets.get(scope="kaggle", key="username"), "key": dbutils.secrets.get(scope="kaggle", key="api_key") }
Option B: Environment Variables
Alternatively, you can modify 01-Bronze.py to read from environment variables or a secure config file.
3. Upload Notebooks to Databricks
-
Upload all notebooks to your Databricks workspace
-
Update the
%runpaths in each notebook to match your workspace structure:- Current path:
/Workspace/Users/pablo.peralta@upb.edu.co/medallion-automated-nfl/00-Config - Update to:
/Workspace/Users/your.email@domain.com/medallion-automated-nfl/00-Config - Or use relative paths:
./00-Config(if notebooks are in the same directory)
- Current path:
-
Ensure all notebooks are in the same directory structure
4. Configure Databricks Workflow
-
In Databricks, go to Workflows โ Create Workflow
-
Add tasks for each notebook in order:
- Task 1:
00-Config(Configuration) - Task 2:
01-Bronze(Data Ingestion) - depends on Task 1 - Task 3:
02-Silver(Data Cleaning) - depends on Task 2 - Task 4:
03-Gold(Feature Engineering) - depends on Task 3 - Task 5:
04.1-ML(Model Training) - depends on Task 4
- Task 1:
-
Configure cluster settings:
- Cluster Mode: Standard or High Concurrency
- Databricks Runtime: 13.3 LTS or later
- Node Type: Choose based on data volume (e.g., Standard_DS3_v2 or larger)
- Autoscaling: Enabled (recommended)
-
Set up schedule (optional):
- Configure trigger for periodic runs if needed
- Or run manually for one-time execution
5. Verify Setup
- Run
00-Config.pyinteractively to verify Azure ADLS connection - Check that paths are correctly configured
- Verify that the container exists in your Azure Storage Account
๐ Project Structure
medallion-automated-nfl/
โโโ 00-Config.py # Configuration (create from 00-Config-example.py)
โโโ 00-Config-example.py # Template configuration file
โโโ 01-Bronze.py # Bronze layer: Data ingestion from Kaggle
โโโ 02-Silver-EDA.py # Silver layer: Exploratory data analysis
โโโ 02-Silver.py # Silver layer: Data cleaning and transformation
โโโ 03-Gold.py # Gold layer: Feature engineering and aggregation
โโโ 04.1-ML.py # ML layer: Model training and prediction
โโโ image.png # Pipeline workflow diagram
โโโ .gitignore # Git ignore rules (excludes secrets)
โโโ README.md # This file
๐ง Technologies Used
- PySpark: Distributed data processing and transformations
- Delta Lake: ACID transactions, time travel, and schema evolution
- SparkML: Machine learning library for distributed ML
- Azure ADLS Gen2: Scalable data lake storage with hierarchical namespace
- Databricks Workflows: Pipeline orchestration and scheduling
- Kaggle API: Automated dataset download
- SQL: Data queries and aggregations
๐ Data Flow
Bronze Layer
- Downloads NFL Big Data Bowl 2026 dataset from Kaggle
- Ingests raw CSV files into Delta format
- Adds audit columns:
ingestion_timestamp,source_file,row_hash - Partitions by
seasonandweek - Stores in:
abfss://.../bronze/
Silver Layer
- Cleans and validates data quality
- Removes duplicates using row hashes
- Handles missing values
- Performs exploratory data analysis
- Stores cleaned data in:
abfss://.../silver/
Gold Layer
- Creates engineered features:
- Physical projections (velocity decomposition, kinematics)
- Distance and proximity features
- Aggregated statistics per play
- Prepares ML-ready datasets
- Stores in:
abfss://.../gold/
ML Layer
- Trains multiple models (GBT, Random Forest, Decision Tree)
- Uses cross-validation for hyperparameter tuning
- Tracks experiments with MLflow
- Generates predictions for test set
๐ Security Best Practices
- Never commit secrets: The
00-Config.pyfile is excluded via.gitignore - Use Databricks Secrets: Store sensitive credentials in Databricks secret scopes
- Rotate credentials: If credentials are exposed, immediately rotate them
- SAS Token permissions: Use least-privilege SAS tokens with appropriate expiration dates
- Review access: Regularly audit who has access to your Databricks workspace and Azure resources
๐ Notes
- The pipeline is designed to run in Databricks Workflows but can also run interactively
- All data is stored in Delta format for better performance, reliability, and time travel capabilities
- The medallion architecture ensures data quality, traceability, and reproducibility
- Notebooks use
%runmagic command to share configuration across the pipeline - Task values are used to pass paths between workflow tasks
๐ Troubleshooting
Common Issues
-
Azure ADLS Connection Failed
- Verify SAS token is valid and has correct permissions
- Check that storage account name is correct
- Ensure container exists
-
Kaggle API Authentication Error
- Verify credentials are correct
- Check that Kaggle API key has proper permissions
- Ensure
/root/.kaggle/kaggle.jsonhas correct permissions (600)
-
Path Not Found Errors
- Verify workspace paths in
%runcommands - Ensure all notebooks are uploaded to Databricks
- Check that paths match your workspace structure
- Verify workspace paths in
-
Cluster Configuration Issues
- Ensure cluster has sufficient memory for data volume
- Check that Databricks Runtime version supports required libraries
- Verify autoscaling is configured appropriately
๐ Additional Resources
- Databricks Medallion Architecture
- Delta Lake Documentation
- SparkML Documentation
- Azure ADLS Gen2 Documentation
- Kaggle API Documentation
๐ License
This project is for educational purposes as part of the NFL Big Data Bowl 2026 competition.
๐ค Author
Pablo Peralta - NFL Big Data Bowl 2026 Competition Entry
00-Config.pyis in.gitignore(already configured)- All secrets are removed from committed files
- You've rotated any exposed credentials
- The
00-Config-example.pytemplate is available for others to use