sharankumarreddyk/kubeai-ops
AI-powered Kubernetes incident response platform. Detects issues, analyzes root causes using Claude/OpenAI/Ollama, and auto-remediates. Features ChatOps (Slack/Discord/Teams), PagerDuty/Jira integration, RBAC, ML-based learning, CLI, and real-time dashboard.
KubeAI-Ops
A production-ready, AI-powered Kubernetes incident response platform that works with any tech stack.
KubeAI-Ops automatically detects issues in your Kubernetes cluster, analyzes root causes using Claude AI, and takes remediation actions - all while you sleep.
Why KubeAI-Ops?
Traditional monitoring tells you something is wrong. KubeAI-Ops tells you why it's wrong and fixes it automatically.
Traditional Alerting: KubeAI-Ops:
Alert: "Pod CrashLoopBackOff" -> Alert received
| |
v v
Page on-call engineer AI analyzes metrics + logs
| |
v v
SSH into cluster Root cause: "Memory leak in
| user-service causing OOM kills.
v Heap grew 300% in 2 hours."
Dig through logs |
| v
v Auto-remediation: Pod restarted,
Maybe find the issue deployment scaled, team notified
| |
v v
Manual fix Team reviews summary,
not firefighting
Works With Any Tech Stack
KubeAI-Ops doesn't care what language your services are written in. As long as they expose:
| Requirement | Purpose | Example |
|---|---|---|
/metrics endpoint |
Prometheus scraping | prom-client (Node), prometheus_client (Python), micrometer (Java) |
/health endpoint |
Liveness probe | Return {"status": "ok"} |
| JSON logs to stdout | Log aggregation | winston (Node), structlog (Python), logback (Java) |
That's it. Three things, and your service is fully integrated with AI-powered incident response.
Language Examples
Node.js / Express
// 1. Add metrics
const promClient = require('prom-client');
promClient.collectDefaultMetrics();
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.send(await promClient.register.metrics());
});
// 2. Add health endpoint
app.get('/health', (req, res) => res.json({ status: 'ok' }));
// 3. Use JSON logging
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.json(),
transports: [new winston.transports.Console()]
});Python / FastAPI
# 1. Add metrics
from prometheus_client import make_asgi_app, REGISTRY
app.mount("/metrics", make_asgi_app())
# 2. Add health endpoint
@app.get("/health")
def health():
return {"status": "ok"}
# 3. Use JSON logging
import structlog
structlog.configure(
processors=[structlog.processors.JSONRenderer()]
)Go
// 1. Add metrics
import "github.com/prometheus/client_golang/prometheus/promhttp"
http.Handle("/metrics", promhttp.Handler())
// 2. Add health endpoint
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
})
// 3. Use JSON logging
import "go.uber.org/zap"
logger, _ := zap.NewProduction()Java / Spring Boot
# application.yml - that's literally it for Spring Boot
management:
endpoints:
web:
exposure:
include: health,prometheus
endpoint:
health:
show-details: always
# Add to pom.xml
# spring-boot-starter-actuator
# micrometer-registry-prometheusRust
// 1. Add metrics (using actix-web-prom)
use actix_web_prom::PrometheusMetrics;
let prometheus = PrometheusMetrics::new("api", Some("/metrics"), None);
App::new().wrap(prometheus)
// 2. Add health endpoint
#[get("/health")]
async fn health() -> impl Responder {
HttpResponse::Ok().json(json!({"status": "ok"}))
}
// 3. Use JSON logging (tracing + tracing-subscriber)
tracing_subscriber::fmt().json().init();Quick Start
Option 1: Local Development (5 minutes)
# Clone the repo
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops
# Start everything locally
./local-dev/setup.sh
# That's it! Access:
# - Your apps: http://localhost:8080
# - Grafana: http://localhost:3000 (admin/admin)
# - Dashboard: http://localhost:5173Option 2: Deploy to AWS EKS
# 1. Configure AWS credentials
aws configure
# 2. Deploy infrastructure
cd terraform/environments/dev
terragrunt apply
# 3. Deploy platform
kubectl apply -k kubernetes/overlays/dev
kubectl apply -k argocd/installOption 3: Add to Existing Cluster
# Install just the AI agent and observability stack
helm install kubeai-ops ./kubernetes/helm-charts/app-chart \
--set aiAgent.enabled=true \
--set observability.enabled=true \
--set sampleApps.enabled=falseArchitecture
┌─────────────────────────────────────────────────────────────────────┐
│ YOUR APPLICATIONS │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node.js │ │ Python │ │ Go │ │ Java │ ... │
│ │ Service │ │ Service │ │ Service │ │ Service │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ /metrics │ /metrics │ /metrics │ /metrics │
│ │ /health │ /health │ /health │ /health │
│ │ JSON logs │ JSON logs │ JSON logs │ JSON logs │
└─────────┼─────────────┼─────────────┼─────────────┼────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ KUBEAI-OPS PLATFORM │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ PROMETHEUS │ │ LOKI │ │ ALERTMANAGER │ │
│ │ (metrics) │ │ (logs) │ │ (alerts) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └──────────────────────┼──────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ AI INCIDENT AGENT │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ AI ENGINES: Claude │ OpenAI │ Ollama │ Bedrock │ Mock │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ 1. Receive alert (webhook) 5. Learn from resolution │ │
│ │ 2. Correlate metrics + logs 6. Notify via ChatOps │ │
│ │ 3. AI root cause analysis 7. Create tickets (Jira/GH) │ │
│ │ 4. Execute remediation 8. Escalate (PD/Opsgenie) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────┼────────────────────────────────┐ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐│
│ │REMEDIATE │ │DASHBOARD │ │ CHATOPS │ │ TICKETS │ │ESCALATION││
│ │ Restart │ │ Timeline │ │ Slack │ │ Jira │ │ PagerDuty││
│ │ Scale │ │ Analytics│ │ Discord │ │ GitHub │ │ Opsgenie ││
│ │ Rollback │ │ Metrics │ │ Teams │ │ Issues │ │ ││
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘│
│ │
│ ┌────────────────────────────────────────────────────────────────┐│
│ │ CLI: kubeai status │ diagnose │ incidents │ runbooks ││
│ └────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘
What's Included
Core Platform
| Component | Description |
|---|---|
| AI Incident Agent | Multi-engine AI (Claude, OpenAI, Ollama, Bedrock) for root cause analysis and auto-remediation |
| Incident Dashboard | Real-time SvelteKit UI with timeline, analytics, and remediation controls |
| CLI Tool | Full-featured CLI with diagnosis, runbooks, incident management, and shell mode |
| Observability Stack | Pre-configured Prometheus, Grafana dashboards, Loki, AlertManager |
AI Engine Support
| Engine | Description |
|---|---|
| Claude (Anthropic) | Production-ready with Claude 3.5 Sonnet/Opus support |
| OpenAI | GPT-4, GPT-4 Turbo, GPT-3.5 Turbo support |
| Ollama | Local/self-hosted LLMs (Llama 3, Mistral, CodeLlama) |
| AWS Bedrock | Managed AI with Claude, Titan, Llama models |
| Mock Engine | Testing without API costs with configurable scenarios |
Integrations
| Category | Integrations |
|---|---|
| ChatOps | Slack, Discord, Microsoft Teams - interactive incident response |
| Alerting | PagerDuty, Opsgenie - escalation and on-call management |
| Ticketing | Jira, GitHub Issues - automatic ticket creation |
| Monitoring | Datadog - metrics forwarding and enrichment |
Security & Authentication
| Component | Description |
|---|---|
| RBAC | Role-based access control (Admin, Operator, Viewer, Service) |
| OIDC/SSO | Enterprise SSO with any OIDC provider |
| API Keys | Service-to-service authentication |
| Privacy Controls | PII redaction, data retention policies, audit logging |
| OPA/Gatekeeper | Policy-as-code for Kubernetes admission control |
| Network Policies | Default-deny with service-specific rules |
Machine Learning
| Feature | Description |
|---|---|
| Incident Learning | Learns from past incidents to improve recommendations |
| Pattern Recognition | Identifies recurring issues and suggests preventive actions |
| Similarity Matching | Finds related past incidents for faster resolution |
| Feedback Loop | Operator feedback improves AI accuracy over time |
Infrastructure (Optional)
| Component | Description |
|---|---|
| Terraform Modules | Production-ready AWS infrastructure (VPC, EKS, RDS, S3) |
| Kubernetes Manifests | Kustomize-based deployments for local/dev/staging/prod |
| ArgoCD Setup | GitOps-ready ApplicationSets and projects |
| CI/CD Pipelines | GitHub Actions for testing, building, deploying |
Sample Applications
| Service | Purpose |
|---|---|
| API Gateway | FastAPI service demonstrating auth, rate limiting, circuit breaker |
| Order Service | CRUD service with PostgreSQL, events, proper error handling |
| Notification Service | Multi-channel notifications (email, SMS, webhook) |
Configuration
AI Agent Configuration
# ai-incident-agent/config/agent-config.yaml
ai_engine:
backend: "claude" # claude, openai, ollama, bedrock, mock
claude:
model: "claude-sonnet-4-20250514"
api_key: "${ANTHROPIC_API_KEY}"
openai:
model: "gpt-4-turbo"
api_key: "${OPENAI_API_KEY}"
ollama:
model: "llama3:8b"
base_url: "http://ollama:11434"
bedrock:
model_id: "anthropic.claude-3-sonnet-20240229-v1:0"
region: "us-east-1"
mock: # For testing without API costs
response_delay_ms: 500
mock_scenarios:
- trigger: "OOMKilled"
root_cause: "Memory leak in application"
action: "restart_pod"
remediation:
enabled: true
auto_approve:
- restart_pod # Low risk - auto-approve
- scale_replicas
require_approval:
- rollback_deployment # Higher risk - require human approval
- delete_pvc
# ChatOps integrations
chatops:
slack:
enabled: true
webhook_url: "${SLACK_WEBHOOK_URL}"
bot_token: "${SLACK_BOT_TOKEN}"
channel: "#incidents"
interactive: true # Enable slash commands and buttons
discord:
enabled: false
webhook_url: "${DISCORD_WEBHOOK_URL}"
bot_token: "${DISCORD_BOT_TOKEN}"
teams:
enabled: false
webhook_url: "${TEAMS_WEBHOOK_URL}"
# External integrations
integrations:
pagerduty:
enabled: true
api_key: "${PAGERDUTY_API_KEY}"
service_id: "${PAGERDUTY_SERVICE_ID}"
opsgenie:
enabled: false
api_key: "${OPSGENIE_API_KEY}"
jira:
enabled: true
url: "https://your-org.atlassian.net"
email: "${JIRA_EMAIL}"
api_token: "${JIRA_API_TOKEN}"
project_key: "OPS"
github:
enabled: false
token: "${GITHUB_TOKEN}"
repo: "your-org/incidents"
datadog:
enabled: false
api_key: "${DATADOG_API_KEY}"
# Security & Authentication
auth:
enabled: true
provider: "oidc" # oidc, api_key, both
oidc:
issuer_url: "https://your-idp.com"
client_id: "${OIDC_CLIENT_ID}"
client_secret: "${OIDC_CLIENT_SECRET}"
# Privacy & Compliance
privacy:
pii_redaction: true
retention_days: 90
audit_logging: trueAdding Your Services
- Add Prometheus annotations to your deployment:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"- Create alert rules (optional):
# observability/prometheus/alerting-rules/my-service-alerts.yaml
groups:
- name: my-service
rules:
- alert: MyServiceHighErrorRate
expr: rate(http_requests_total{app="my-service", status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical- Deploy and watch the magic happen
Project Structure
kubeai-ops/
├── ai-incident-agent/ # The AI-powered incident response agent
│ ├── agent/ # Core agent logic
│ │ ├── ai_engine/ # Multi-engine AI support
│ │ │ ├── base.py # Abstract base class
│ │ │ ├── claude_engine.py # Anthropic Claude
│ │ │ ├── openai_engine.py # OpenAI GPT models
│ │ │ ├── ollama_engine.py # Local Ollama models
│ │ │ ├── bedrock_engine.py # AWS Bedrock
│ │ │ └── mock_engine.py # Testing engine
│ │ ├── remediation/ # Auto-remediation system
│ │ ├── chatops/ # ChatOps integrations
│ │ │ ├── slack.py # Slack with interactive commands
│ │ │ ├── discord.py # Discord integration
│ │ │ └── teams.py # Microsoft Teams
│ │ ├── integrations/ # External integrations
│ │ │ ├── pagerduty.py # PagerDuty escalation
│ │ │ ├── opsgenie.py # Opsgenie alerts
│ │ │ ├── jira.py # Jira ticket creation
│ │ │ ├── github.py # GitHub Issues
│ │ │ └── datadog.py # Datadog forwarding
│ │ ├── auth/ # Authentication & authorization
│ │ │ ├── rbac.py # Role-based access control
│ │ │ └── oidc.py # OIDC/SSO integration
│ │ ├── learning/ # ML incident learning
│ │ │ └── incident_learner.py
│ │ ├── privacy/ # Privacy & compliance
│ │ │ └── pii_redactor.py
│ │ └── database/ # SQLAlchemy models
│ ├── config/ # Agent configuration
│ └── tests/ # 428+ passing tests
│
├── incident-dashboard/ # SvelteKit real-time dashboard
│ ├── src/
│ │ ├── routes/ # Pages and API routes
│ │ ├── lib/
│ │ │ ├── components/ # Reusable UI components
│ │ │ ├── stores/ # Svelte stores
│ │ │ └── api/ # API client
│ │ └── types/ # TypeScript definitions
│ └── tests/ # 287+ passing tests
│
├── cli/ # KubeAI CLI tool
│ ├── kubeai/
│ │ ├── commands/ # CLI commands
│ │ │ ├── status.py # Service status
│ │ │ ├── diagnose.py # AI diagnosis
│ │ │ ├── incidents.py # Incident management
│ │ │ ├── runbooks.py # Runbook execution
│ │ │ └── config.py # Configuration
│ │ └── api.py # API client
│ └── tests/ # 261+ passing tests
│
├── observability/ # Pre-configured monitoring stack
│ ├── prometheus/ # Metrics + alert rules
│ ├── grafana/ # Dashboards
│ └── loki/ # Log aggregation
│
├── kubernetes/ # Kubernetes manifests
│ ├── base/ # Base resources
│ └── overlays/ # Environment configs
│ ├── local/
│ ├── dev/
│ ├── staging/
│ └── prod/
│
├── terraform/ # AWS infrastructure (optional)
│ ├── modules/ # Reusable modules
│ └── environments/ # Per-environment configs
│
├── docker-compose-demo/ # Local demo environment
│ ├── docker-compose.yml # Full stack with all features
│ ├── demo-scenarios.sh # Interactive demo script
│ └── grafana/ # Pre-configured dashboards
│
├── argocd/ # GitOps setup
├── ci-cd/ # GitHub Actions + policies
├── security/ # Network policies, RBAC
├── local-dev/ # One-command local setup
├── services/ # Sample applications
│
└── docs/ # Documentation
├── architecture.md # System architecture
├── getting-started.md # Setup guide
├── adr/ # Architecture Decision Records
└── runbooks/ # Operational runbooks
Roadmap
Completed
- Core AI incident agent
- Multi-engine AI support (Claude, OpenAI, Ollama, Bedrock)
- Prometheus/Grafana/Loki integration
- ChatOps (Slack, Discord, Microsoft Teams)
- Auto-remediation (restart, scale, rollback)
- Incident dashboard (SvelteKit)
- CLI tool with diagnosis and runbooks
- Mock AI mode for testing
- PagerDuty integration
- Opsgenie integration
- Jira/GitHub issue creation
- Datadog integration
- RBAC & OIDC authentication
- Machine learning incident learner
- Privacy controls & PII redaction
- Comprehensive test suite (976 tests)
Planned
- Multi-cluster support
- Cost analysis integration
- Runbook auto-generation
- Custom remediation plugins
- Mobile app
- Advanced anomaly detection
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Development Setup
# Clone your fork
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops
# Start local environment
./local-dev/setup.sh
# Run tests
cd ai-incident-agent && pytest
cd services/api-gateway && pytest
cd incident-dashboard && npm test
# Make changes, then submit a PRAreas We Need Help
- Multi-cluster federation support
- Additional AI engine integrations
- Language-specific integration guides (Ruby, .NET)
- Mobile app development
- Performance optimizations for large-scale deployments
- Documentation translations
Community
- GitHub Issues: Bug reports and feature requests
- Discussions: Questions and ideas
- Discord: Join our server (coming soon)
License
MIT License - see LICENSE for details.
Use it, modify it, sell it, whatever. Just don't blame us if your AI agent becomes sentient and refuses to restart pods on Fridays.
Acknowledgments
- Anthropic for Claude AI
- Prometheus community
- ArgoCD project
- All our contributors
Reduce alert fatigue with AI-powered incident response.