GitHunt
PR

prakhardewangan2005-hash/FleetOps-PE-Incident-Lab

Production Engineering incident-response lab: SLOs, burn-rate alerts, runbooks, capacity planning, postmortems, change safety

๐Ÿš€ FleetOps โ€“ Production Incident Response Lab

This repository is a working production-style lab, not just a case study.

What this project demonstrates

  • A real running backend service (Python + Flask)
  • Live metrics exposure using Prometheus
  • Latency SLO monitoring and alerting
  • Incident response using documented runbooks
  • Post-incident analysis with written postmortems

Architecture

  • Service: Python Flask API
  • Monitoring: Prometheus
  • Alerts: Burn-rate style latency alert
  • Infra: Docker + Docker Compose

๐Ÿ“š Documentation Structure

  • Monitoring: monitoring/README.md
    Prometheus configuration, alert rules, and SLO-related setup.

  • Runbooks: runbooks/
    Incident response procedures (latency, mitigation, verification).

  • Incidents: incidents/
    Real postmortems documenting detection โ†’ mitigation โ†’ recovery.

  • Operations Docs: docs/
    On-call playbooks, change management, risk register, and operational metrics.

FleetOps โ€” PE Incident Lab

Production Engineering case study: SLOs โ€ข burn-rate alerts โ€ข runbooks โ€ข capacity planning โ€ข postmortems โ€ข change safety

Status Focus Docs

What this repo shows: how I would operate and harden a high-traffic service in production โ€” with operational rigor.

Notes: Docs-first PE portfolio artifact: SLOs, runbooks, postmortems, change safety.


Why this maps strongly to Meta Production Engineering

Meta PEs debug live production issues, define SLOs, enforce paging discipline, write runbooks to reduce MTTR, and plan capacity for peak traffic.

This repo demonstrates:

  • Reliability & performance: SLOs, error budgets, burn-rate alerts
  • Incident response: triage โ†’ mitigation โ†’ validation โ†’ postmortem
  • Capacity planning: headroom, scaling triggers, burst modeling
  • Change safety: canary, rollback, risk controls

System modeled

Clients
โ†“
API Gateway (routing, rate limiting, canary controls)
โ†“
User Service โ€”โ€” Cache
โ†“
Feed Service โ€”โ€” Cache โ€”โ€” DB
โ†“
Observability (metrics + logs + traces)
โ†“
Alerts โ†’ On-call โ†’ Runbook โ†’ Mitigation โ†’ Postmortem โ†’ Prevention

Quick start (recruiter path)

  1. docs/02_slos_and_alerts.md
  2. docs/03_runbooks.md
  3. docs/06_postmortems/
  4. docs/08_change_management.md

Tip: Start with SLOs โ†’ Runbooks โ†’ Postmortems to see incident handling end-to-end.

Key artifacts

  • SLOs & alerts: docs/02_slos_and_alerts.md
  • Runbooks: docs/03_runbooks.md
  • Postmortems: docs/06_postmortems/
  • Change safety: docs/08_change_management.md

Docs index

  • docs/00_overview.md โ€” scope + how to read
  • docs/01_architecture.md โ€” system + failure domains
  • docs/02_slos_and_alerts.md โ€” SLOs, error budgets, burn-rate alerts
  • docs/03_runbooks.md โ€” triage โ†’ mitigate โ†’ validate
  • docs/04_capacity_planning.md โ€” headroom, scaling triggers
  • docs/05_incident_simulations.md โ€” game days + drills
  • docs/06_postmortems/ โ€” incident writeups + prevention
  • docs/07_oncall_playbook.md โ€” oncall habits + escalation
  • docs/08_change_management.md โ€” canary/rollback + guardrails
  • docs/09_risk_register.md โ€” risks + mitigations
  • docs/10_operational_metrics.md โ€” golden signals + dashboards