Bike Demand Prediction for Smart Cities
Level 2 MLOps production-grade bike demand forecasting system with automated data pipelines, experiment tracking, model registry, and comprehensive monitoring
Overview
Production-ready machine learning system that predicts bike rental demand using real-world public APIs and historical data. Built with Docker-first architecture for seamless deployment, the system processes NYC Citi Bike data, enriches it with weather signals, engineers 22+ time-series features, and trains ensemble models (XGBoost + LightGBM) achieving RMSE 0.51 bikes.
Key Features
- ✅ 100% Docker-Based: All services containerized for production deployment
- 📊 Real Data: 50K+ historical bike station records + weather data
- 🤖 Automated ML Pipeline: Feature engineering → Training → Model registry
- 📈 Experiment Tracking: MLflow with model versioning and metrics
- 🔄 Airflow Orchestration: 4 DAGs for data/training automation
- 🎯 Production Models: XGBoost (RMSE 0.56) & LightGBM (RMSE 0.51, R² 0.511)
- 📊 Interactive Dashboard: Streamlit UI for forecasts and monitoring
- 🔍 Comprehensive Monitoring: System health, model performance, data quality
- 🚀 CI/CD Ready: GitHub Actions for automated testing and deployment
Dashboard Screenshots
📚 Complete Tutorial
Want to build this system from scratch? Check out our comprehensive tutorial:
👉 Complete Step-by-Step Tutorial 👈
Perfect for:
- 🎓 Students & Beginners - Learn MLOps end-to-end (20-30 hours)
- 💼 Interview Prep - Understand architecture & design decisions (5 hours)
- 🚀 ML Engineers - Build production ML systems (15 hours)
The tutorial covers:
- Setting up the environment
- Database design & Docker orchestration
- Data collection & feature engineering
- Model training with MLflow
- API development with FastAPI
- Dashboard with Streamlit
- Airflow orchestration
- CI/CD with GitHub Actions
- Design decisions & troubleshooting
Quick Start (15 Minutes)
Prerequisites
- Docker Desktop (Download) - Required
- Git
- 8GB+ RAM (for Docker containers)
- OpenWeatherMap API Key (optional, for live weather data): Free signup
Step 1: Clone Repository
git clone https://github.com/yourusername/Bike-Demand-Prediction-for-Smart-Cities.git
cd Bike-Demand-Prediction-for-Smart-CitiesStep 2: Start All Services with Docker Compose
This single command starts everything:
cd infrastructure
docker-compose up -dThis starts:
- ✅ PostgreSQL (port 5432) - Database for all data
- ✅ MLflow (port 5000) - Experiment tracking & model registry
- ✅ Airflow (port 8080) - Workflow orchestration
- ✅ FastAPI (port 8000) - Prediction API
- ✅ Streamlit Dashboard (port 8501) - Interactive UI
Wait 2-3 minutes for all services to become healthy.
Step 3: Verify Services
# Check all containers are running
docker ps
# You should see all services as "healthy" or "Up"
docker ps --format "table {{.Names}}\t{{.Status}}"Step 4: Access Web Interfaces
Open in your browser:
- 🎯 Dashboard: http://localhost:8501 (Main UI - Start here!)
- 📊 MLflow: http://localhost:5000 (Experiment tracking)
- 🔄 Airflow: http://localhost:8080 (Username:
admin, Password:admin) - 🚀 API Docs: http://localhost:8000/docs (FastAPI Swagger UI)
Step 5: Load Sample Data & Train Model
The system includes pre-loaded sample data and a trained model. To verify everything works:
# 1. Check database has data
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT
(SELECT COUNT(*) FROM bike_stations) as stations,
(SELECT COUNT(*) FROM bike_station_status) as status_records,
(SELECT COUNT(*) FROM weather_data) as weather_records,
(SELECT COUNT(*) FROM features) as feature_records;
"
# Expected output:
# stations | status_records | weather_records | feature_records
# ---------|----------------|-----------------|----------------
# 2000 | 50000+ | 1000+ | 10000+Step 6: Make Your First Prediction!
- Open Dashboard: http://localhost:8501
- You should see: ✅ Dashboard Version: 2025-12-27-v2 | API URL: http://api:8000 | Status: 🟢 Connected
- Click "🔮 Demand Forecast" in sidebar
- Select a station from dropdown (e.g., "1 Ave & E 110 St (25 bikes)")
- Choose forecast horizon (7 hours recommended)
- Click "🔮 Generate Forecast"
- View the interactive chart with predictions!
That's it! You now have a fully functional bike demand prediction system! 🎉
Dashboard Features
Main Pages
1. 🔮 Demand Forecast
- Select Station: Choose from 100+ real NYC bike stations
- Forecast Horizon: 1-168 hours ahead
- Interactive Charts: View predicted demand with confidence intervals
- Weather Integration: Predictions use latest weather data
How to Use:
1. Go to http://localhost:8501
2. Click "🔮 Demand Forecast" in sidebar
3. Select station: "Central Park S & 6 Ave (capacity: 59 bikes)"
4. Choose hours: 24 hours
5. Click "Generate Forecast"
6. See prediction chart!
2. 📊 Model Performance
- Current Production Model: View active model metrics
- Model Comparison: Compare all trained models
- Performance Charts: RMSE/MAE/R² trends over time
- Feature Importance: Top features driving predictions
3. ✅ Data Quality
- Data Completeness: Missing values analysis
- Feature Distribution: Histograms and statistics
- Data Drift Detection: Alert when feature distributions change
4. 💓 System Health
- Component Status: API, Model, Database, MLflow health
- Service Metrics: Response times, error rates
- System Gauges: Overall health score
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Docker Compose Infrastructure │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL │ │ MLflow │ │ Airflow │
│ Database │ │ Server │ │ Webserver │
│ │ │ │ │ │
│ • Stations │ │ • Experiments│ │ • 4 DAGs │
│ • Status │ │ • Models │ │ • Automation │
│ • Weather │ │ • Registry │ │ │
│ • Features │ └──────────────┘ └──────────────┘
└──────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Prediction Pipeline (FastAPI + Streamlit) │
│ │
│ 1. User selects station in Streamlit │
│ 2. Dashboard calls FastAPI /predict endpoint │
│ 3. API loads Production model from MLflow │
│ 4. Generate 22 features (temporal + lag + weather) │
│ 5. Model predicts demand │
│ 6. Return prediction with confidence interval │
│ 7. Dashboard displays interactive forecast chart │
└──────────────────────────────────────────────────────────────┘
Machine Learning Pipeline
Features (22 Total)
Temporal Features (8):
hour_of_day,day_of_week,day_of_month,monthis_weekend,is_business_hours,is_morning_rush,is_evening_rush
Lag Features (6):
bikes_lag_1h,bikes_lag_6h,bikes_lag_24hdocks_lag_1h,docks_lag_6h,docks_lag_24h
Rolling Statistics (4):
bikes_rolling_mean_3h,bikes_rolling_mean_6hbikes_rolling_std_3h,bikes_rolling_std_6h
Weather Features (4):
temperature,humidity,wind_speed,precipitation
Production Models
| Model | Test RMSE | Test R² | Test MAPE | Status |
|---|---|---|---|---|
| LightGBM v8 | 0.51 | 0.511 | 4.5% | ✅ Production |
| XGBoost v7 | 0.56 | 0.411 | 5.2% | Staged |
Training Results:
✓ Data loaded: 10,000 samples
✓ Train/Val/Test: 70%/15%/15% (no shuffling, time-series split)
✓ XGBoost trained: Test RMSE = 0.56, R² = 0.411
✓ LightGBM trained: Test RMSE = 0.51, R² = 0.511
✅ LightGBM v8 promoted to Production (best model)Train New Model
# Run training pipeline (inside Docker)
docker exec bike_demand_training python -m src.training.train_pipeline
# Or rebuild and run training container
docker-compose up training
# View results in MLflow
open http://localhost:5000Data Pipeline
Database Schema
-- Station metadata (2000+ stations)
bike_stations (
station_id VARCHAR PRIMARY KEY,
name VARCHAR,
latitude FLOAT,
longitude FLOAT,
capacity INTEGER
);
-- Historical status (50K+ records)
bike_station_status (
id SERIAL PRIMARY KEY,
station_id VARCHAR,
timestamp TIMESTAMP,
bikes_available INTEGER,
docks_available INTEGER
);
-- Weather data (1K+ records)
weather_data (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP,
temperature FLOAT,
humidity FLOAT,
wind_speed FLOAT,
precipitation FLOAT
);
-- Engineered features (10K+ records)
features (
id SERIAL PRIMARY KEY,
station_id VARCHAR,
timestamp TIMESTAMP,
feature_json JSONB,
feature_version VARCHAR
);Sample Data Stats
# Check data in database
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT
'Stations' as table_name, COUNT(*) FROM bike_stations
UNION ALL
SELECT 'Status Records', COUNT(*) FROM bike_station_status
UNION ALL
SELECT 'Weather Records', COUNT(*) FROM weather_data
UNION ALL
SELECT 'Features', COUNT(*) FROM features;
"API Endpoints
The FastAPI server provides production prediction endpoints:
# Health check
curl http://localhost:8000/health
# Get current production model info
curl http://localhost:8000/monitoring/models/current
# Make single prediction
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"station_id": "66db237e-0aca-11e7-82f6-3863bb44ef7c",
"timestamp": "2025-12-27T18:00:00"
}'
# Generate 24-hour forecast
curl "http://localhost:8000/predict/station/66db237e-0aca-11e7-82f6-3863bb44ef7c/forecast?hours_ahead=24"
# Batch predictions
curl -X POST http://localhost:8000/predict/batch \
-H "Content-Type: application/json" \
-d '{
"predictions": [
{"station_id": "station_1", "timestamp": "2025-12-27T18:00:00"},
{"station_id": "station_2", "timestamp": "2025-12-27T18:00:00"}
]
}'API Documentation: http://localhost:8000/docs
Configuration
Environment Variables
All services read from environment variables set in docker-compose.yml:
# Database
DB_HOST: postgres
DB_PORT: 5432
DB_USER: postgres
DB_PASSWORD: postgres
DB_DATABASE: bike_demand_db
# MLflow
MLFLOW_TRACKING_URI: http://mlflow:5000
# API (for Dashboard)
API_URL: http://api:8000
# Weather (optional)
WEATHER_API_KEY: your_key_hereImportant: The dashboard uses API_URL=http://api:8000 (Docker service name) when running in containers. This is configured via environment variables, NOT in .streamlit/secrets.toml.
Dashboard Configuration
Located in dashboard/.streamlit/config.toml:
[server]
port = 8501
enableCORS = false
enableXsrfProtection = true
maxUploadSize = 200
runOnSave = true
fileWatcherType = "auto"
[browser]
gatherUsageStats = false
serverAddress = "0.0.0.0"
[theme]
primaryColor = "#1E88E5"
backgroundColor = "#FFFFFF"
secondaryBackgroundColor = "#F0F2F6"
textColor = "#262730"
font = "sans serif"CI/CD Pipeline
GitHub Actions Workflows
The project includes 3 automated GitHub Actions workflows for continuous integration and deployment:
1. CI - Docker Build & Validate
Triggers: Push/PR to main or develop branches
Jobs:
-
Code Quality Checks
- Black (code formatting)
- Flake8 (linting)
- isort (import sorting)
-
Build Docker Images
- Builds all 4 Docker images (API, Dashboard, Airflow, Training)
- Uses GitHub Actions cache for faster builds
- Validates images build successfully
-
Validate Project Structure
- Checks all required files exist
- Verifies directory structure is correct
- Ensures production scripts are present
-
Security Scan
- Safety (dependency vulnerability check)
- Bandit (security linter for Python code)
2. CD - Build and Deploy
Triggers: Push to main branch or version tags (v*)
Jobs:
- Builds production Docker images for:
- FastAPI server
- Streamlit dashboard
- Airflow workers
- Training pipeline
- Pushes images to GitHub Container Registry (ghcr.io)
- Tags with SHA and
latest - Deploys to staging/production environments
3. Model Training
Triggers:
- Manual workflow dispatch only
Options:
use_test_data: Use synthetic test data for CI validation (default: true)true: Generates 30 days of synthetic data for 3 test stationsfalse: Uses production data (requires data to be loaded first)
Jobs:
- Builds training Docker container
- Optionally generates synthetic test data (3 stations, 720 hourly records)
- Runs model training pipeline
- Trains XGBoost and LightGBM models
- Evaluates performance on test set
- Promotes best model to Production in MLflow
- Archives previous production model
Viewing Workflow Runs
# View all workflow runs
open https://github.com/shima-maleki/Bike-Demand-Prediction-for-Smart-Cities/actions
# Trigger model training manually with test data (for CI validation)
# Go to Actions tab → Model Training → Run workflow → use_test_data: true
# Trigger model training with production data (after data is loaded)
# Go to Actions tab → Model Training → Run workflow → use_test_data: falseLocal CI/CD Testing
Before pushing, test locally:
# Run linting
black --check src/ dashboard/ scripts/
flake8 src/ dashboard/ scripts/ --max-line-length=120
isort --check-only src/ dashboard/ scripts/
# Build all Docker images
docker build -t bike-demand-api:test -f docker/api/Dockerfile .
docker build -t bike-demand-dashboard:test -f docker/dashboard/Dockerfile .
docker build -t bike-demand-airflow:test -f docker/airflow/Dockerfile .
docker build -t bike-demand-training:test -f docker/training/Dockerfile .
# Run security checks
pip install safety bandit
safety check
bandit -r src/ -llTroubleshooting
Dashboard Shows "API Offline"
Solution:
# 1. Check API container is running
docker ps | grep bike_demand_api
# 2. Check API logs
docker logs bike_demand_api --tail 50
# 3. Verify API responds
docker exec bike_demand_dashboard curl http://api:8000/health
# 4. Restart dashboard (clears browser cache issues)
docker restart bike_demand_dashboard
# 5. Open dashboard in incognito window
# Navigate to: http://localhost:8501Root Cause: Browser caching old API URL. Hard refresh with Cmd+Shift+R (Mac) or Ctrl+Shift+F5 (Windows).
Forecast Fails with "Missing Features"
Error: Required features not available: ['bikes_lag_1h', ...]
Solution:
# The system needs historical data for lag features
# If you only have 1 snapshot per station, lag features default to 0
# Check data availability
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT station_id, COUNT(*) as records, MAX(timestamp) as latest
FROM bike_station_status
GROUP BY station_id
ORDER BY records DESC
LIMIT 5;
"
# If records < 24 per station, lag features will be sparse
# This is OK - model will still predict using default valuesThe predictor automatically fills missing lag/rolling features with 0.0 if insufficient historical data exists.
Forecast Shows TypeError: "object dtype"
Error: pandas dtypes must be int, float or bool. Fields with bad pandas dtypes: temperature: object
Solution: Already fixed in src/serving/predictor.py:367-383. The predictor now:
- Explicitly converts weather columns to float
- Uses default weather values if database has no data
- Converts any remaining object dtypes to numeric
If you see this error, restart the API:
docker restart bike_demand_api"No stations available" in Dashboard
Solution:
# Verify stations exist in database
docker exec bike_demand_postgres psql -U postgres -d bike_demand_db -c "
SELECT COUNT(*) FROM bike_stations;
"
# If count is 0, you need to load station data
# Run data backfill scripts (see Production Scripts section)MLflow Shows "Connection Refused"
Solution:
# 1. Check MLflow container is running
docker ps | grep mlflow
# 2. Test MLflow endpoint
curl http://localhost:5000/api/2.0/mlflow/experiments/list
# 3. Check logs for errors
docker logs bike_demand_mlflow --tail 50
# 4. Restart MLflow
docker restart bike_demand_mlflowAirflow DAGs Not Showing
Solution:
# 1. Check Airflow scheduler is running
docker logs bike_demand_airflow_scheduler --tail 50
# 2. Check DAG files exist
ls -la airflow/dags/
# 3. Verify DAGs in Airflow CLI
docker exec bike_demand_airflow_webserver airflow dags list
# 4. Unpause DAGs in UI
# Go to http://localhost:8080 and toggle DAGs to "On"Database Connection Errors
Solution:
# 1. Check PostgreSQL is healthy
docker exec bike_demand_postgres pg_isready
# 2. Test connection from another container
docker exec bike_demand_api python -c "
from src.config.database import get_db_context
from sqlalchemy import text
with get_db_context() as db:
result = db.execute(text('SELECT 1')).scalar()
print(f'Database OK: {result}')
"
# 3. Check PostgreSQL logs
docker logs bike_demand_postgres --tail 100Slow Dashboard Performance
Tips:
- Use shorter forecast horizons (7 hours instead of 168)
- Limit stations shown (top 20 by activity)
- Add database indexes:
CREATE INDEX idx_status_station_time ON bike_station_status(station_id, timestamp);
CREATE INDEX idx_features_station_time ON features(station_id, timestamp);Clear All Data and Restart
# Stop all services
cd infrastructure
docker-compose down -v # ⚠️ Deletes all data!
# Restart fresh
docker-compose up -d
# Wait for services to start (2-3 minutes)
docker psProduction Scripts
Essential scripts for data loading and model training:
# 1. Load sample bike station data
docker exec bike_demand_training python scripts/load_sample_data.py
# 2. Generate features from raw data
docker exec bike_demand_training python scripts/generate_features.py
# 3. Train production model
docker exec bike_demand_training python -m src.training.train_pipeline
# 4. Backfill historical weather (optional)
docker exec bike_demand_training python scripts/backfill_weather.pyTech Stack
| Component | Technology | Version |
|---|---|---|
| Language | Python | 3.11 |
| Database | PostgreSQL | 16 (Alpine) |
| ML Models | XGBoost, LightGBM | 2.0+, 4.1+ |
| Experiment Tracking | MLflow | 2.9+ |
| Orchestration | Apache Airflow | 2.8+ |
| API Framework | FastAPI | 0.108+ |
| Dashboard | Streamlit | 1.29+ |
| Containerization | Docker, Docker Compose | Latest |
Project Structure
bike-demand-prediction/
├── src/ # Source code
│ ├── config/ # Database & settings
│ ├── data/ # Data collectors
│ ├── features/ # Feature engineering (22 features)
│ │ ├── temporal_features.py
│ │ ├── lag_features.py
│ │ ├── rolling_features.py
│ │ └── weather_features.py
│ ├── training/ # Training pipeline
│ │ └── train_pipeline.py # XGBoost + LightGBM training
│ ├── serving/ # Prediction API
│ │ ├── api/main.py # FastAPI app
│ │ ├── predictor.py # Prediction logic
│ │ └── model_loader.py # Load from MLflow
│ └── utils/
│ └── feature_store.py # Feature retrieval
├── dashboard/ # Streamlit UI
│ ├── app.py # Main dashboard
│ └── pages/ # 4 interactive pages
│ ├── 1_🔮_Demand_Forecast.py
│ ├── 2_📊_Model_Performance.py
│ ├── 3_✅_Data_Quality.py
│ └── 4_💓_System_Health.py
├── infrastructure/
│ └── docker-compose.yml # All services orchestration
├── docker/ # Dockerfiles
│ ├── training/Dockerfile
│ ├── api/Dockerfile
│ ├── dashboard/Dockerfile
│ └── airflow/Dockerfile
├── scripts/ # Production scripts
├── config/ # ML configs
└── pyproject.toml # Python dependencies
Key Improvements & Fixes
This system includes production-ready fixes for common MLOps challenges:
1. Dashboard-API Connection
- ✅ Uses environment variables (
API_URL) instead of hardcoded URLs - ✅ Supports Docker service names (
http://api:8000) - ✅ Clear version indicator shows connection status
2. Feature Generation
- ✅ Generates ALL 22 required features matching model training
- ✅ Handles missing lag features (fills with 0 for sparse data)
- ✅ Explicit dtype conversion (weather columns → float)
- ✅ Default values when no weather data available
3. Station Selection
- ✅ Loads real stations from database (not hardcoded)
- ✅ Shows station names + capacity in dropdown
- ✅ Only shows stations with historical data
4. Robust Prediction
- ✅ Works with limited historical data (graceful degradation)
- ✅ Automatic weather fallback (defaults if missing)
- ✅ Comprehensive error handling and logging
Interview Talking Points
"I built a production-grade Level 2 MLOps system for bike demand forecasting:"
-
End-to-End Docker Pipeline: "Fully containerized system with 5 services (PostgreSQL, MLflow, Airflow, FastAPI, Streamlit) orchestrated with Docker Compose - zero manual setup, can deploy anywhere in 2 minutes"
-
Production ML Models: "Trained ensemble models (XGBoost + LightGBM) on 10K samples with 22 engineered features including temporal patterns, lag features (1h/6h/24h), rolling statistics, and weather data - achieved RMSE 0.51 bikes with LightGBM"
-
Automated CI/CD Pipeline: "Implemented 3 GitHub Actions workflows - CI for linting/security/Docker builds, CD for automated deployment to GitHub Container Registry, and weekly automated model retraining with performance validation and MLflow promotion"
-
Automated Feature Engineering: "Built modular feature generators for temporal, lag, rolling, weather, and holiday features - generates all 22 features on-demand during inference, handles missing data gracefully with default values"
-
MLflow Integration: "Implemented full MLflow tracking - logs all experiments, registers models to central registry, automatically promotes best model to Production stage based on test RMSE"
-
Production-Ready API: "FastAPI server with health checks, batch predictions, multi-hour forecasts, and comprehensive error handling - includes automatic dtype conversion and default value fallback for robustness"
-
Interactive Dashboard: "Built Streamlit dashboard with 4 pages (forecast, performance, quality, health) - loads real stations from database, generates predictions via API, displays interactive Plotly charts"
-
Real-World Data: "Used actual NYC Citi Bike historical data (50K+ records) and weather API - no synthetic data, fully reproducible"
License
Apache License 2.0
Acknowledgments
- NYC Citi Bike for open bike-sharing data
- Open-Meteo for free historical weather API
- MLflow, Airflow, FastAPI, and Streamlit communities
🐳 Built with Docker | 🚀 Production Ready | 💼 Interview Ready | 📊 Real Data
Questions? Open an issue or check the troubleshooting section above.





