138 results for “topic:site-reliability-engineering”
A curated list of Site Reliability and Production Engineering resources.
A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
A Chaos Engineering Platform for Kubernetes.
A curated list of Chaos Engineering resources.
An easy to use and powerful chaos engineering experiment toolkit.(阿里巴巴开源的一款简单易用、功能强大的混沌实验注入工具)
Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
Chaos testing, network emulation, and stress testing tool for containers
SRE Agent - CNCF Sandbox Project
Web UI for Jaeger
A curated list of Site Reliability and Production Engineering Tools
A collection of postmortem templates
This repository includes resources which are more than sufficient to prepare for google interview if you are applying for a software engineer position or a site reliability engineer position
What to Read to Learn More About DevOps
Curated list of good SRE interview questions.
DevOps Happiness: 1-click or 1-prompt MCP. Deploy apps + infra + CI/CD on your cloud. Happy humans + reliable agents. 🚀
Open-source AI copilot that lets you chat with your observability data and code 🧙♂️
A chaos engineering platform for supporting the complete fault drill lifecycle.
A role-playing game for incident management training
Google Site Reliability Engineering book converted in audio
OpenShift Guide. Learn about the Red Hat OpenShift Container Platform, Data Science, Code Ready Containers, Podman, Buildah, and Kubernetes.
[WWW'25][ASE'24] RCAEval: A Benchmark for Root Cause Analysis.
The Skinny Distributed Lock Service
Welcome To The World of DevOps. An ongoing & curated collection of awesome software, libraries, learning tutorials, tools and resources and cool stuff about DevOps.
My opinionated list of products and tools used for high-scalability projects
Calculate how much downtime should be permitted in your Service Level Agreement or Objective
A collection of SRE tools
📚 Index for my study topics
This repository helps performance testers and engineers who wants to dive into DevOps and SRE world.
[FSE'24 - 🏆 Best Artifact Award] BARO: Robust Root Cause Analysis for Time Series Data.
A collection templates ported from the SRE Workbook