Bike Demand Prediction for Smart Cities

Level 2 MLOps production-grade bike demand forecasting system with automated data pipelines, experiment tracking, model registry, and comprehensive monitoring

Overview

Production-ready machine learning system that predicts bike rental demand using real-world public APIs and historical data. Built with Docker-first architecture for seamless deployment, the system processes NYC Citi Bike data, enriches it with weather signals, engineers 22+ time-series features, and trains ensemble models (XGBoost + LightGBM) achieving RMSE 0.51 bikes.

Key Features

✅ 100% Docker-Based: All services containerized for production deployment
📊 Real Data: 50K+ historical bike station records + weather data
🤖 Automated ML Pipeline: Feature engineering → Training → Model registry
📈 Experiment Tracking: MLflow with model versioning and metrics
🔄 Airflow Orchestration: 4 DAGs for data/training automation
🎯 Production Models: XGBoost (RMSE 0.56) & LightGBM (RMSE 0.51, R² 0.511)
📊 Interactive Dashboard: Streamlit UI for forecasts and monitoring
🔍 Comprehensive Monitoring: System health, model performance, data quality
🚀 CI/CD Ready: GitHub Actions for automated testing and deployment

Dashboard Screenshots

Dashboard Home System overview	Demand Forecast Real-time predictions	Model Performance Current model metrics
Metrics Visualization Error comparison	Data Quality Monitoring & freshness	System Health System metrics

📚 Complete Tutorial

Want to build this system from scratch? Check out our comprehensive tutorial:

👉 Complete Step-by-Step Tutorial 👈

Perfect for:

🎓 Students & Beginners - Learn MLOps end-to-end (20-30 hours)
💼 Interview Prep - Understand architecture & design decisions (5 hours)
🚀 ML Engineers - Build production ML systems (15 hours)

The tutorial covers:

Setting up the environment
Database design & Docker orchestration
Data collection & feature engineering
Model training with MLflow
API development with FastAPI
Dashboard with Streamlit
Airflow orchestration
CI/CD with GitHub Actions
Design decisions & troubleshooting

Quick Start (15 Minutes)

Prerequisites

Docker Desktop (Download) - Required
Git
8GB+ RAM (for Docker containers)
OpenWeatherMap API Key (optional, for live weather data): Free signup

Step 1: Clone Repository

git clone https://github.com/yourusername/Bike-Demand-Prediction-for-Smart-Cities.git
cd Bike-Demand-Prediction-for-Smart-Cities

Step 2: Start All Services with Docker Compose

This single command starts everything:

cd infrastructure
docker-compose up -d

This starts:

✅ PostgreSQL (port 5432) - Database for all data
✅ MLflow (port 5000) - Experiment tracking & model registry
✅ Airflow (port 8080) - Workflow orchestration
✅ FastAPI (port 8000) - Prediction API
✅ Streamlit Dashboard (port 8501) - Interactive UI

Wait 2-3 minutes for all services to become healthy.

Step 3: Verify Services

# Check all containers are running
docker ps

# You should see all services as "healthy" or "Up"
docker ps --format "table {{.Names}}\t{{.Status}}"

Step 4: Access Web Interfaces

Open in your browser:

🎯 Dashboard: http://localhost:8501 (Main UI - Start here!)
📊 MLflow: http://localhost:5000 (Experiment tracking)
🔄 Airflow: http://localhost:8080 (Username: admin, Password: admin)
🚀 API Docs: http://localhost:8000/docs (FastAPI Swagger UI)

Step 5: Load Sample Data & Train Model

The system includes pre-loaded sample data and a trained model. To verify everything works:

# 1. Check database has data
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT
    (SELECT COUNT(*) FROM bike_stations) as stations,
    (SELECT COUNT(*) FROM bike_station_status) as status_records,
    (SELECT COUNT(*) FROM weather_data) as weather_records,
    (SELECT COUNT(*) FROM features) as feature_records;
"

# Expected output:
# stations | status_records | weather_records | feature_records
# ---------|----------------|-----------------|----------------
#    2000  |     50000+     |      1000+      |    10000+

Step 6: Make Your First Prediction!

Open Dashboard: http://localhost:8501
You should see: ✅ Dashboard Version: 2025-12-27-v2 | API URL: http://api:8000 | Status: 🟢 Connected
Click "🔮 Demand Forecast" in sidebar
Select a station from dropdown (e.g., "1 Ave & E 110 St (25 bikes)")
Choose forecast horizon (7 hours recommended)
Click "🔮 Generate Forecast"
View the interactive chart with predictions!

That's it! You now have a fully functional bike demand prediction system! 🎉

Dashboard Features

Main Pages

1. 🔮 Demand Forecast

Select Station: Choose from 100+ real NYC bike stations
Forecast Horizon: 1-168 hours ahead
Interactive Charts: View predicted demand with confidence intervals
Weather Integration: Predictions use latest weather data

How to Use:

1. Go to http://localhost:8501
2. Click "🔮 Demand Forecast" in sidebar
3. Select station: "Central Park S & 6 Ave (capacity: 59 bikes)"
4. Choose hours: 24 hours
5. Click "Generate Forecast"
6. See prediction chart!

2. 📊 Model Performance

Current Production Model: View active model metrics
Model Comparison: Compare all trained models
Performance Charts: RMSE/MAE/R² trends over time
Feature Importance: Top features driving predictions

3. ✅ Data Quality

Data Completeness: Missing values analysis
Feature Distribution: Histograms and statistics
Data Drift Detection: Alert when feature distributions change

4. 💓 System Health

Component Status: API, Model, Database, MLflow health
Service Metrics: Response times, error rates
System Gauges: Overall health score

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                Docker Compose Infrastructure                     │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌──────────────┐    ┌──────────────┐      ┌──────────────┐
│  PostgreSQL  │    │   MLflow     │      │   Airflow    │
│   Database   │    │   Server     │      │  Webserver   │
│              │    │              │      │              │
│ • Stations   │    │ • Experiments│      │ • 4 DAGs     │
│ • Status     │    │ • Models     │      │ • Automation │
│ • Weather    │    │ • Registry   │      │              │
│ • Features   │    └──────────────┘      └──────────────┘
└──────────────┘
        │
        ▼
┌──────────────────────────────────────────────────────────────┐
│              Prediction Pipeline (FastAPI + Streamlit)      │
│                                                              │
│  1. User selects station in Streamlit                       │
│  2. Dashboard calls FastAPI /predict endpoint               │
│  3. API loads Production model from MLflow                  │
│  4. Generate 22 features (temporal + lag + weather)         │
│  5. Model predicts demand                                   │
│  6. Return prediction with confidence interval              │
│  7. Dashboard displays interactive forecast chart           │
└──────────────────────────────────────────────────────────────┘

Machine Learning Pipeline

Features (22 Total)

Temporal Features (8):

hour_of_day, day_of_week, day_of_month, month
is_weekend, is_business_hours, is_morning_rush, is_evening_rush

Lag Features (6):

bikes_lag_1h, bikes_lag_6h, bikes_lag_24h
docks_lag_1h, docks_lag_6h, docks_lag_24h

Rolling Statistics (4):

bikes_rolling_mean_3h, bikes_rolling_mean_6h
bikes_rolling_std_3h, bikes_rolling_std_6h

Weather Features (4):

temperature, humidity, wind_speed, precipitation

Production Models

Model	Test RMSE	Test R²	Test MAPE	Status
LightGBM v8	0.51	0.511	4.5%	✅ Production
XGBoost v7	0.56	0.411	5.2%	Staged

Training Results:

✓ Data loaded: 10,000 samples
✓ Train/Val/Test: 70%/15%/15% (no shuffling, time-series split)
✓ XGBoost trained: Test RMSE = 0.56, R² = 0.411
✓ LightGBM trained: Test RMSE = 0.51, R² = 0.511
✅ LightGBM v8 promoted to Production (best model)

Train New Model

# Run training pipeline (inside Docker)
docker exec bike_demand_training python -m src.training.train_pipeline

# Or rebuild and run training container
docker-compose up training

# View results in MLflow
open http://localhost:5000

Data Pipeline

Database Schema

-- Station metadata (2000+ stations)
bike_stations (
    station_id VARCHAR PRIMARY KEY,
    name VARCHAR,
    latitude FLOAT,
    longitude FLOAT,
    capacity INTEGER
);

-- Historical status (50K+ records)
bike_station_status (
    id SERIAL PRIMARY KEY,
    station_id VARCHAR,
    timestamp TIMESTAMP,
    bikes_available INTEGER,
    docks_available INTEGER
);

-- Weather data (1K+ records)
weather_data (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMP,
    temperature FLOAT,
    humidity FLOAT,
    wind_speed FLOAT,
    precipitation FLOAT
);

-- Engineered features (10K+ records)
features (
    id SERIAL PRIMARY KEY,
    station_id VARCHAR,
    timestamp TIMESTAMP,
    feature_json JSONB,
    feature_version VARCHAR
);

Sample Data Stats

# Check data in database
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT
    'Stations' as table_name, COUNT(*) FROM bike_stations
UNION ALL
SELECT 'Status Records', COUNT(*) FROM bike_station_status
UNION ALL
SELECT 'Weather Records', COUNT(*) FROM weather_data
UNION ALL
SELECT 'Features', COUNT(*) FROM features;
"

API Endpoints

The FastAPI server provides production prediction endpoints:

# Health check
curl http://localhost:8000/health

# Get current production model info
curl http://localhost:8000/monitoring/models/current

# Make single prediction
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "station_id": "66db237e-0aca-11e7-82f6-3863bb44ef7c",
    "timestamp": "2025-12-27T18:00:00"
  }'

# Generate 24-hour forecast
curl "http://localhost:8000/predict/station/66db237e-0aca-11e7-82f6-3863bb44ef7c/forecast?hours_ahead=24"

# Batch predictions
curl -X POST http://localhost:8000/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "predictions": [
      {"station_id": "station_1", "timestamp": "2025-12-27T18:00:00"},
      {"station_id": "station_2", "timestamp": "2025-12-27T18:00:00"}
    ]
  }'

API Documentation: http://localhost:8000/docs

Configuration

Environment Variables

All services read from environment variables set in docker-compose.yml:

# Database
DB_HOST: postgres
DB_PORT: 5432
DB_USER: postgres
DB_PASSWORD: postgres
DB_DATABASE: bike_demand_db

# MLflow
MLFLOW_TRACKING_URI: http://mlflow:5000

# API (for Dashboard)
API_URL: http://api:8000

# Weather (optional)
WEATHER_API_KEY: your_key_here

Important: The dashboard uses API_URL=http://api:8000 (Docker service name) when running in containers. This is configured via environment variables, NOT in .streamlit/secrets.toml.

Dashboard Configuration

Located in dashboard/.streamlit/config.toml:

[server]
port = 8501
enableCORS = false
enableXsrfProtection = true
maxUploadSize = 200
runOnSave = true
fileWatcherType = "auto"

[browser]
gatherUsageStats = false
serverAddress = "0.0.0.0"

[theme]
primaryColor = "#1E88E5"
backgroundColor = "#FFFFFF"
secondaryBackgroundColor = "#F0F2F6"
textColor = "#262730"
font = "sans serif"

CI/CD Pipeline

GitHub Actions Workflows

The project includes 3 automated GitHub Actions workflows for continuous integration and deployment:

1. CI - Docker Build & Validate

Triggers: Push/PR to main or develop branches

Jobs:

Code Quality Checks
- Black (code formatting)
- Flake8 (linting)
- isort (import sorting)
Build Docker Images
- Builds all 4 Docker images (API, Dashboard, Airflow, Training)
- Uses GitHub Actions cache for faster builds
- Validates images build successfully
Validate Project Structure
- Checks all required files exist
- Verifies directory structure is correct
- Ensures production scripts are present
Security Scan
- Safety (dependency vulnerability check)
- Bandit (security linter for Python code)

Status:

2. CD - Build and Deploy

Triggers: Push to main branch or version tags (v*)

Jobs:

Builds production Docker images for:
- FastAPI server
- Streamlit dashboard
- Airflow workers
- Training pipeline
Pushes images to GitHub Container Registry (ghcr.io)
Tags with SHA and latest
Deploys to staging/production environments

Status:

3. Model Training

Triggers:

Manual workflow dispatch only

Options:

use_test_data: Use synthetic test data for CI validation (default: true)
- true: Generates 30 days of synthetic data for 3 test stations
- false: Uses production data (requires data to be loaded first)

Jobs:

Builds training Docker container
Optionally generates synthetic test data (3 stations, 720 hourly records)
Runs model training pipeline
Trains XGBoost and LightGBM models
Evaluates performance on test set
Promotes best model to Production in MLflow
Archives previous production model

Status:

Viewing Workflow Runs

# View all workflow runs
open https://github.com/shima-maleki/Bike-Demand-Prediction-for-Smart-Cities/actions

# Trigger model training manually with test data (for CI validation)
# Go to Actions tab → Model Training → Run workflow → use_test_data: true

# Trigger model training with production data (after data is loaded)
# Go to Actions tab → Model Training → Run workflow → use_test_data: false

Local CI/CD Testing

Before pushing, test locally:

# Run linting
black --check src/ dashboard/ scripts/
flake8 src/ dashboard/ scripts/ --max-line-length=120
isort --check-only src/ dashboard/ scripts/

# Build all Docker images
docker build -t bike-demand-api:test -f docker/api/Dockerfile .
docker build -t bike-demand-dashboard:test -f docker/dashboard/Dockerfile .
docker build -t bike-demand-airflow:test -f docker/airflow/Dockerfile .
docker build -t bike-demand-training:test -f docker/training/Dockerfile .

# Run security checks
pip install safety bandit
safety check
bandit -r src/ -ll

Troubleshooting

Dashboard Shows "API Offline"

Solution:

# 1. Check API container is running
docker ps | grep bike_demand_api

# 2. Check API logs
docker logs bike_demand_api --tail 50

# 3. Verify API responds
docker exec bike_demand_dashboard curl http://api:8000/health

# 4. Restart dashboard (clears browser cache issues)
docker restart bike_demand_dashboard

# 5. Open dashboard in incognito window
# Navigate to: http://localhost:8501

Root Cause: Browser caching old API URL. Hard refresh with Cmd+Shift+R (Mac) or Ctrl+Shift+F5 (Windows).

Forecast Fails with "Missing Features"

Error: Required features not available: ['bikes_lag_1h', ...]

Solution:

# The system needs historical data for lag features
# If you only have 1 snapshot per station, lag features default to 0

# Check data availability
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT station_id, COUNT(*) as records, MAX(timestamp) as latest
FROM bike_station_status
GROUP BY station_id
ORDER BY records DESC
LIMIT 5;
"

# If records < 24 per station, lag features will be sparse
# This is OK - model will still predict using default values

The predictor automatically fills missing lag/rolling features with 0.0 if insufficient historical data exists.

Forecast Shows TypeError: "object dtype"

Error: pandas dtypes must be int, float or bool. Fields with bad pandas dtypes: temperature: object

Solution: Already fixed in src/serving/predictor.py:367-383. The predictor now:

Explicitly converts weather columns to float
Uses default weather values if database has no data
Converts any remaining object dtypes to numeric

If you see this error, restart the API:

docker restart bike_demand_api

"No stations available" in Dashboard

Solution:

# Verify stations exist in database
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT COUNT(*) FROM bike_stations;
"

# If count is 0, you need to load station data
# Run data backfill scripts (see Production Scripts section)

MLflow Shows "Connection Refused"

Solution:

# 1. Check MLflow container is running
docker ps | grep mlflow

# 2. Test MLflow endpoint
curl http://localhost:5000/api/2.0/mlflow/experiments/list

# 3. Check logs for errors
docker logs bike_demand_mlflow --tail 50

# 4. Restart MLflow
docker restart bike_demand_mlflow

Airflow DAGs Not Showing

Solution:

# 1. Check Airflow scheduler is running
docker logs bike_demand_airflow_scheduler --tail 50

# 2. Check DAG files exist
ls -la airflow/dags/

# 3. Verify DAGs in Airflow CLI
docker exec bike_demand_airflow_webserver airflow dags list

# 4. Unpause DAGs in UI
# Go to http://localhost:8080 and toggle DAGs to "On"

Database Connection Errors

Solution:

# 1. Check PostgreSQL is healthy
docker exec bike_demand_postgres pg_isready

# 2. Test connection from another container
docker exec bike_demand_api python -c "
from src.config.database import get_db_context
from sqlalchemy import text
with get_db_context() as db:
    result = db.execute(text('SELECT 1')).scalar()
    print(f'Database OK: {result}')
"

# 3. Check PostgreSQL logs
docker logs bike_demand_postgres --tail 100

Slow Dashboard Performance

Tips:

Use shorter forecast horizons (7 hours instead of 168)
Limit stations shown (top 20 by activity)
Add database indexes:

CREATE INDEX idx_status_station_time ON bike_station_status(station_id, timestamp);
CREATE INDEX idx_features_station_time ON features(station_id, timestamp);

Clear All Data and Restart

# Stop all services
cd infrastructure
docker-compose down -v  # ⚠️ Deletes all data!

# Restart fresh
docker-compose up -d

# Wait for services to start (2-3 minutes)
docker ps

Production Scripts

Essential scripts for data loading and model training:

# 1. Load sample bike station data
docker exec bike_demand_training python scripts/load_sample_data.py

# 2. Generate features from raw data
docker exec bike_demand_training python scripts/generate_features.py

# 3. Train production model
docker exec bike_demand_training python -m src.training.train_pipeline

# 4. Backfill historical weather (optional)
docker exec bike_demand_training python scripts/backfill_weather.py

Tech Stack

Component	Technology	Version
Language	Python	3.11
Database	PostgreSQL	16 (Alpine)
ML Models	XGBoost, LightGBM	2.0+, 4.1+
Experiment Tracking	MLflow	2.9+
Orchestration	Apache Airflow	2.8+
API Framework	FastAPI	0.108+
Dashboard	Streamlit	1.29+
Containerization	Docker, Docker Compose	Latest

Project Structure

bike-demand-prediction/
├── src/                          # Source code
│   ├── config/                   # Database & settings
│   ├── data/                     # Data collectors
│   ├── features/                 # Feature engineering (22 features)
│   │   ├── temporal_features.py
│   │   ├── lag_features.py
│   │   ├── rolling_features.py
│   │   └── weather_features.py
│   ├── training/                 # Training pipeline
│   │   └── train_pipeline.py    # XGBoost + LightGBM training
│   ├── serving/                  # Prediction API
│   │   ├── api/main.py          # FastAPI app
│   │   ├── predictor.py         # Prediction logic
│   │   └── model_loader.py      # Load from MLflow
│   └── utils/
│       └── feature_store.py     # Feature retrieval
├── dashboard/                    # Streamlit UI
│   ├── app.py                   # Main dashboard
│   └── pages/                   # 4 interactive pages
│       ├── 1_🔮_Demand_Forecast.py
│       ├── 2_📊_Model_Performance.py
│       ├── 3_✅_Data_Quality.py
│       └── 4_💓_System_Health.py
├── infrastructure/
│   └── docker-compose.yml       # All services orchestration
├── docker/                      # Dockerfiles
│   ├── training/Dockerfile
│   ├── api/Dockerfile
│   ├── dashboard/Dockerfile
│   └── airflow/Dockerfile
├── scripts/                     # Production scripts
├── config/                      # ML configs
└── pyproject.toml              # Python dependencies

Key Improvements & Fixes

This system includes production-ready fixes for common MLOps challenges:

1. Dashboard-API Connection

✅ Uses environment variables (API_URL) instead of hardcoded URLs
✅ Supports Docker service names (http://api:8000)
✅ Clear version indicator shows connection status

2. Feature Generation

✅ Generates ALL 22 required features matching model training
✅ Handles missing lag features (fills with 0 for sparse data)
✅ Explicit dtype conversion (weather columns → float)
✅ Default values when no weather data available

3. Station Selection

✅ Loads real stations from database (not hardcoded)
✅ Shows station names + capacity in dropdown
✅ Only shows stations with historical data

4. Robust Prediction

✅ Works with limited historical data (graceful degradation)
✅ Automatic weather fallback (defaults if missing)
✅ Comprehensive error handling and logging

Interview Talking Points

"I built a production-grade Level 2 MLOps system for bike demand forecasting:"

End-to-End Docker Pipeline: "Fully containerized system with 5 services (PostgreSQL, MLflow, Airflow, FastAPI, Streamlit) orchestrated with Docker Compose - zero manual setup, can deploy anywhere in 2 minutes"
Production ML Models: "Trained ensemble models (XGBoost + LightGBM) on 10K samples with 22 engineered features including temporal patterns, lag features (1h/6h/24h), rolling statistics, and weather data - achieved RMSE 0.51 bikes with LightGBM"
Automated CI/CD Pipeline: "Implemented 3 GitHub Actions workflows - CI for linting/security/Docker builds, CD for automated deployment to GitHub Container Registry, and weekly automated model retraining with performance validation and MLflow promotion"
Automated Feature Engineering: "Built modular feature generators for temporal, lag, rolling, weather, and holiday features - generates all 22 features on-demand during inference, handles missing data gracefully with default values"
MLflow Integration: "Implemented full MLflow tracking - logs all experiments, registers models to central registry, automatically promotes best model to Production stage based on test RMSE"
Production-Ready API: "FastAPI server with health checks, batch predictions, multi-hour forecasts, and comprehensive error handling - includes automatic dtype conversion and default value fallback for robustness"
Interactive Dashboard: "Built Streamlit dashboard with 4 pages (forecast, performance, quality, health) - loads real stations from database, generates predictions via API, displays interactive Plotly charts"
Real-World Data: "Used actual NYC Citi Bike historical data (50K+ records) and weather API - no synthetic data, fully reproducible"

License

Apache License 2.0

Acknowledgments

NYC Citi Bike for open bike-sharing data
Open-Meteo for free historical weather API
MLflow, Airflow, FastAPI, and Streamlit communities

🐳 Built with Docker | 🚀 Production Ready | 💼 Interview Ready | 📊 Real Data

Questions? Open an issue or check the troubleshooting section above.

Sanjanayadav07/Bike-Demand-Prediction

Bike Demand Prediction for Smart Cities

Overview

Key Features

Dashboard Screenshots

📚 Complete Tutorial

Quick Start (15 Minutes)

Prerequisites

Step 1: Clone Repository

Step 2: Start All Services with Docker Compose

Step 3: Verify Services

Step 4: Access Web Interfaces

Step 5: Load Sample Data & Train Model

Step 6: Make Your First Prediction!

Dashboard Features

Main Pages

1. 🔮 Demand Forecast

2. 📊 Model Performance

3. ✅ Data Quality

4. 💓 System Health

Architecture

Machine Learning Pipeline

Features (22 Total)

Production Models

Train New Model

Data Pipeline

Database Schema

Sample Data Stats

API Endpoints

Configuration

Environment Variables

Dashboard Configuration

CI/CD Pipeline

GitHub Actions Workflows

1. CI - Docker Build & Validate

2. CD - Build and Deploy

3. Model Training

Viewing Workflow Runs

Local CI/CD Testing

Troubleshooting

Dashboard Shows "API Offline"

Forecast Fails with "Missing Features"

Forecast Shows TypeError: "object dtype"

"No stations available" in Dashboard

MLflow Shows "Connection Refused"

Airflow DAGs Not Showing

Database Connection Errors

Slow Dashboard Performance

Clear All Data and Restart

Production Scripts

Tech Stack

Project Structure

Key Improvements & Fixes

1. Dashboard-API Connection

2. Feature Generation

3. Station Selection

4. Robust Prediction

Interview Talking Points

License

Acknowledgments

On this page

Languages

Contributors