🚀 FleetOps – Production Incident Response Lab

This repository is a working production-style lab, not just a case study.

What this project demonstrates

A real running backend service (Python + Flask)
Live metrics exposure using Prometheus
Latency SLO monitoring and alerting
Incident response using documented runbooks
Post-incident analysis with written postmortems

Architecture

Service: Python Flask API
Monitoring: Prometheus
Alerts: Burn-rate style latency alert
Infra: Docker + Docker Compose

📚 Documentation Structure

Monitoring: monitoring/README.md
Prometheus configuration, alert rules, and SLO-related setup.
Runbooks: runbooks/
Incident response procedures (latency, mitigation, verification).
Incidents: incidents/
Real postmortems documenting detection → mitigation → recovery.
Operations Docs: docs/
On-call playbooks, change management, risk register, and operational metrics.

FleetOps — PE Incident Lab

Production Engineering case study: SLOs • burn-rate alerts • runbooks • capacity planning • postmortems • change safety

What this repo shows: how I would operate and harden a high-traffic service in production — with operational rigor.

Notes: Docs-first PE portfolio artifact: SLOs, runbooks, postmortems, change safety.

Why this maps strongly to Meta Production Engineering

Meta PEs debug live production issues, define SLOs, enforce paging discipline, write runbooks to reduce MTTR, and plan capacity for peak traffic.

This repo demonstrates:

Reliability & performance: SLOs, error budgets, burn-rate alerts
Incident response: triage → mitigation → validation → postmortem
Capacity planning: headroom, scaling triggers, burst modeling
Change safety: canary, rollback, risk controls

System modeled

Clients
↓
API Gateway (routing, rate limiting, canary controls)
↓
User Service —— Cache
↓
Feed Service —— Cache —— DB
↓
Observability (metrics + logs + traces)
↓
Alerts → On-call → Runbook → Mitigation → Postmortem → Prevention

Quick start (recruiter path)

docs/02_slos_and_alerts.md
docs/03_runbooks.md
docs/06_postmortems/
docs/08_change_management.md

Tip: Start with SLOs → Runbooks → Postmortems to see incident handling end-to-end.

Key artifacts

SLOs & alerts: docs/02_slos_and_alerts.md
Runbooks: docs/03_runbooks.md
Postmortems: docs/06_postmortems/
Change safety: docs/08_change_management.md

Docs index

docs/00_overview.md — scope + how to read
docs/01_architecture.md — system + failure domains
docs/02_slos_and_alerts.md — SLOs, error budgets, burn-rate alerts
docs/03_runbooks.md — triage → mitigate → validate
docs/04_capacity_planning.md — headroom, scaling triggers
docs/05_incident_simulations.md — game days + drills
docs/06_postmortems/ — incident writeups + prevention
docs/07_oncall_playbook.md — oncall habits + escalation
docs/08_change_management.md — canary/rollback + guardrails
docs/09_risk_register.md — risks + mitigations
docs/10_operational_metrics.md — golden signals + dashboards

prakhardewangan2005-hash/FleetOps-PE-Incident-Lab