prakhardewangan2005-hash/FleetOps-PE-Incident-Lab
Production Engineering incident-response lab: SLOs, burn-rate alerts, runbooks, capacity planning, postmortems, change safety
๐ FleetOps โ Production Incident Response Lab
This repository is a working production-style lab, not just a case study.
What this project demonstrates
- A real running backend service (Python + Flask)
- Live metrics exposure using Prometheus
- Latency SLO monitoring and alerting
- Incident response using documented runbooks
- Post-incident analysis with written postmortems
Architecture
- Service: Python Flask API
- Monitoring: Prometheus
- Alerts: Burn-rate style latency alert
- Infra: Docker + Docker Compose
๐ Documentation Structure
-
Monitoring:
monitoring/README.md
Prometheus configuration, alert rules, and SLO-related setup. -
Runbooks:
runbooks/
Incident response procedures (latency, mitigation, verification). -
Incidents:
incidents/
Real postmortems documenting detection โ mitigation โ recovery. -
Operations Docs:
docs/
On-call playbooks, change management, risk register, and operational metrics.
FleetOps โ PE Incident Lab
Production Engineering case study: SLOs โข burn-rate alerts โข runbooks โข capacity planning โข postmortems โข change safety
What this repo shows: how I would operate and harden a high-traffic service in production โ with operational rigor.
Notes: Docs-first PE portfolio artifact: SLOs, runbooks, postmortems, change safety.
Why this maps strongly to Meta Production Engineering
Meta PEs debug live production issues, define SLOs, enforce paging discipline, write runbooks to reduce MTTR, and plan capacity for peak traffic.
This repo demonstrates:
- Reliability & performance: SLOs, error budgets, burn-rate alerts
- Incident response: triage โ mitigation โ validation โ postmortem
- Capacity planning: headroom, scaling triggers, burst modeling
- Change safety: canary, rollback, risk controls
System modeled
Clients
โ
API Gateway (routing, rate limiting, canary controls)
โ
User Service โโ Cache
โ
Feed Service โโ Cache โโ DB
โ
Observability (metrics + logs + traces)
โ
Alerts โ On-call โ Runbook โ Mitigation โ Postmortem โ Prevention
Quick start (recruiter path)
docs/02_slos_and_alerts.mddocs/03_runbooks.mddocs/06_postmortems/docs/08_change_management.md
Tip: Start with SLOs โ Runbooks โ Postmortems to see incident handling end-to-end.
Key artifacts
- SLOs & alerts:
docs/02_slos_and_alerts.md - Runbooks:
docs/03_runbooks.md - Postmortems:
docs/06_postmortems/ - Change safety:
docs/08_change_management.md
Docs index
docs/00_overview.mdโ scope + how to readdocs/01_architecture.mdโ system + failure domainsdocs/02_slos_and_alerts.mdโ SLOs, error budgets, burn-rate alertsdocs/03_runbooks.mdโ triage โ mitigate โ validatedocs/04_capacity_planning.mdโ headroom, scaling triggersdocs/05_incident_simulations.mdโ game days + drillsdocs/06_postmortems/โ incident writeups + preventiondocs/07_oncall_playbook.mdโ oncall habits + escalationdocs/08_change_management.mdโ canary/rollback + guardrailsdocs/09_risk_register.mdโ risks + mitigationsdocs/10_operational_metrics.mdโ golden signals + dashboards