26 results for “topic:site-reliability”
A curated list of Site Reliability and Production Engineering resources.
A collection of postmortem templates
A role-playing game for incident management training
Calculate how much downtime should be permitted in your Service Level Agreement or Objective
A collection templates ported from the SRE Workbook
A list of common Disaster Recovery (DR) scenarios for software companies
Making service expectation declaration easier.
An ongoing & curated collection of awesome SRE software and tools, libraries and frameworks, engineering books and blogs, philosophical principles, technical guidelines, practical tools about the field of Site Reliablity Engineering (SRE)
SRE Exporter service enables exporting Edge Orchestrator's Service Level Indicators (SLIs) and its key runtime metrics to external systems.
Overall map of topics to cover for my “Engineering for Site Reliability” blog series.
Enterprise-grade Cloud Deployment Showcase featuring production-ready patterns. Demonstrates container orchestration with Docker & Nginx, Infrastructure as Code (IaC), multi-cloud strategies, and advanced observability using Jaeger tracing and structured logging. Fully automated via GitHub Actions CI/CD.
Smartshield Infrastructure Guide
Auto-detect Cloudflare network outages and toggle DNS proxy status to bypass failing infrastructure. Monitors domain health and switches between proxied/direct DNS modes.
Kubernetes-native health checker that automatically finds and verifies your latest pods are ready before considering deployments successful - perfect for preview environments
🤖 AI SRE - Intelligent Site Reliability Engineering Toolbox. Lean containerized CLI toolbox for AI-powered Kubernetes cluster remediation via MCP Server API. Perfect for N8N workflow integration.
Prometheus Blackbox Exporter’ın kurulumu ve yapılandırılması üzerine bir rehber. HTTP, HTTPS, DNS, TCP ve ICMP üzerinden servislerin ulaşılabilirliğini test etmek için konfigürasyon dosyaları ve örnek kullanım senaryoları içerir.
Fully automated AWS cloud deployment using Terraform and Docker with CI-ready DevOps practices.
Terraform module for Cloudflare maintenance pages with IP allowlisting and scheduling
A .Net Standard library for working with the Uptime Robot API.
Die ultimative Checkliste für Core Web Vitals, LCP und CLS Optimierung. Masterclass Tutorial: 👇
🌐 Discover top resources for Site Reliability Engineering, focusing on open-source tools and accessible knowledge to build scalable, reliable systems.
AI-powered security scanning platform with static analysis and autonomous red-teaming agents. Live at aedify.ai.
Gerd by Onyx is a light-weight chaos monkey implementation for k8s (kubernetes)
Ghost CMS files for my personal domain
End-to-end predictive reliability platform with anomaly detection, auto-remediation, and comprehensive observability for microservices
🌩️ Auto-detect Cloudflare outages and toggle DNS proxy to ensure your domains remain accessible during service disruptions.