KubeAI-Ops

A production-ready, AI-powered Kubernetes incident response platform that works with any tech stack.

KubeAI-Ops automatically detects issues in your Kubernetes cluster, analyzes root causes using Claude AI, and takes remediation actions - all while you sleep.

Why KubeAI-Ops?

Traditional monitoring tells you something is wrong. KubeAI-Ops tells you why it's wrong and fixes it automatically.

Traditional Alerting:                    KubeAI-Ops:

Alert: "Pod CrashLoopBackOff"     ->    Alert received
       |                                     |
       v                                     v
  Page on-call engineer           AI analyzes metrics + logs
       |                                     |
       v                                     v
  SSH into cluster                 Root cause: "Memory leak in
       |                           user-service causing OOM kills.
       v                           Heap grew 300% in 2 hours."
  Dig through logs                          |
       |                                     v
       v                           Auto-remediation: Pod restarted,
  Maybe find the issue             deployment scaled, team notified
       |                                     |
       v                                     v
  Manual fix                       Team reviews summary,
                                   not firefighting

Works With Any Tech Stack

KubeAI-Ops doesn't care what language your services are written in. As long as they expose:

Requirement	Purpose	Example
`/metrics` endpoint	Prometheus scraping	`prom-client` (Node), `prometheus_client` (Python), `micrometer` (Java)
`/health` endpoint	Liveness probe	Return `{"status": "ok"}`
JSON logs to stdout	Log aggregation	`winston` (Node), `structlog` (Python), `logback` (Java)

That's it. Three things, and your service is fully integrated with AI-powered incident response.

Language Examples

Node.js / Express

// 1. Add metrics
const promClient = require('prom-client');
promClient.collectDefaultMetrics();
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.send(await promClient.register.metrics());
});

// 2. Add health endpoint
app.get('/health', (req, res) => res.json({ status: 'ok' }));

// 3. Use JSON logging
const winston = require('winston');
const logger = winston.createLogger({
  format: winston.format.json(),
  transports: [new winston.transports.Console()]
});

Python / FastAPI

# 1. Add metrics
from prometheus_client import make_asgi_app, REGISTRY
app.mount("/metrics", make_asgi_app())

# 2. Add health endpoint
@app.get("/health")
def health():
    return {"status": "ok"}

# 3. Use JSON logging
import structlog
structlog.configure(
    processors=[structlog.processors.JSONRenderer()]
)

// 1. Add metrics
import "github.com/prometheus/client_golang/prometheus/promhttp"
http.Handle("/metrics", promhttp.Handler())

// 2. Add health endpoint
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    json.NewEncoder(w).Encode(map[string]string{"status": "ok"})
})

// 3. Use JSON logging
import "go.uber.org/zap"
logger, _ := zap.NewProduction()

Java / Spring Boot

# application.yml - that's literally it for Spring Boot
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus
  endpoint:
    health:
      show-details: always

# Add to pom.xml
# spring-boot-starter-actuator
# micrometer-registry-prometheus

Rust

// 1. Add metrics (using actix-web-prom)
use actix_web_prom::PrometheusMetrics;
let prometheus = PrometheusMetrics::new("api", Some("/metrics"), None);
App::new().wrap(prometheus)

// 2. Add health endpoint
#[get("/health")]
async fn health() -> impl Responder {
    HttpResponse::Ok().json(json!({"status": "ok"}))
}

// 3. Use JSON logging (tracing + tracing-subscriber)
tracing_subscriber::fmt().json().init();

Quick Start

Option 1: Local Development (5 minutes)

# Clone the repo
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops

# Start everything locally
./local-dev/setup.sh

# That's it! Access:
# - Your apps: http://localhost:8080
# - Grafana:   http://localhost:3000 (admin/admin)
# - Dashboard: http://localhost:5173

Option 2: Deploy to AWS EKS

# 1. Configure AWS credentials
aws configure

# 2. Deploy infrastructure
cd terraform/environments/dev
terragrunt apply

# 3. Deploy platform
kubectl apply -k kubernetes/overlays/dev
kubectl apply -k argocd/install

Option 3: Add to Existing Cluster

# Install just the AI agent and observability stack
helm install kubeai-ops ./kubernetes/helm-charts/app-chart \
  --set aiAgent.enabled=true \
  --set observability.enabled=true \
  --set sampleApps.enabled=false

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        YOUR APPLICATIONS                            │
│    ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│    │ Node.js  │  │  Python  │  │   Go     │  │   Java   │  ...     │
│    │ Service  │  │ Service  │  │ Service  │  │ Service  │          │
│    └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │
│         │  /metrics   │  /metrics   │  /metrics   │  /metrics      │
│         │  /health    │  /health    │  /health    │  /health       │
│         │  JSON logs  │  JSON logs  │  JSON logs  │  JSON logs     │
└─────────┼─────────────┼─────────────┼─────────────┼────────────────┘
          │             │             │             │
          ▼             ▼             ▼             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     KUBEAI-OPS PLATFORM                             │
│                                                                     │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐ │
│  │   PROMETHEUS    │    │      LOKI       │    │  ALERTMANAGER   │ │
│  │  (metrics)      │    │    (logs)       │    │   (alerts)      │ │
│  └────────┬────────┘    └────────┬────────┘    └────────┬────────┘ │
│           │                      │                      │          │
│           └──────────────────────┼──────────────────────┘          │
│                                  ▼                                  │
│  ┌───────────────────────────────────────────────────────────────┐ │
│  │                      AI INCIDENT AGENT                        │ │
│  │  ┌─────────────────────────────────────────────────────────┐  │ │
│  │  │  AI ENGINES: Claude │ OpenAI │ Ollama │ Bedrock │ Mock  │  │ │
│  │  └─────────────────────────────────────────────────────────┘  │ │
│  │                                                                │ │
│  │  1. Receive alert (webhook)     5. Learn from resolution      │ │
│  │  2. Correlate metrics + logs    6. Notify via ChatOps         │ │
│  │  3. AI root cause analysis      7. Create tickets (Jira/GH)   │ │
│  │  4. Execute remediation         8. Escalate (PD/Opsgenie)     │ │
│  └───────────────────────────────────────────────────────────────┘ │
│                                 │                                   │
│    ┌────────────────────────────┼────────────────────────────────┐ │
│    ▼              ▼             ▼              ▼                 ▼ │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐│
│  │REMEDIATE │ │DASHBOARD │ │ CHATOPS  │ │ TICKETS  │ │ESCALATION││
│  │ Restart  │ │ Timeline │ │ Slack    │ │ Jira     │ │ PagerDuty││
│  │ Scale    │ │ Analytics│ │ Discord  │ │ GitHub   │ │ Opsgenie ││
│  │ Rollback │ │ Metrics  │ │ Teams    │ │ Issues   │ │          ││
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘│
│                                                                    │
│  ┌────────────────────────────────────────────────────────────────┐│
│  │  CLI: kubeai status │ diagnose │ incidents │ runbooks          ││
│  └────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘

What's Included

Core Platform

Component	Description
AI Incident Agent	Multi-engine AI (Claude, OpenAI, Ollama, Bedrock) for root cause analysis and auto-remediation
Incident Dashboard	Real-time SvelteKit UI with timeline, analytics, and remediation controls
CLI Tool	Full-featured CLI with diagnosis, runbooks, incident management, and shell mode
Observability Stack	Pre-configured Prometheus, Grafana dashboards, Loki, AlertManager

AI Engine Support

Engine	Description
Claude (Anthropic)	Production-ready with Claude 3.5 Sonnet/Opus support
OpenAI	GPT-4, GPT-4 Turbo, GPT-3.5 Turbo support
Ollama	Local/self-hosted LLMs (Llama 3, Mistral, CodeLlama)
AWS Bedrock	Managed AI with Claude, Titan, Llama models
Mock Engine	Testing without API costs with configurable scenarios

Integrations

Category	Integrations
ChatOps	Slack, Discord, Microsoft Teams - interactive incident response
Alerting	PagerDuty, Opsgenie - escalation and on-call management
Ticketing	Jira, GitHub Issues - automatic ticket creation
Monitoring	Datadog - metrics forwarding and enrichment

Security & Authentication

Component	Description
RBAC	Role-based access control (Admin, Operator, Viewer, Service)
OIDC/SSO	Enterprise SSO with any OIDC provider
API Keys	Service-to-service authentication
Privacy Controls	PII redaction, data retention policies, audit logging
OPA/Gatekeeper	Policy-as-code for Kubernetes admission control
Network Policies	Default-deny with service-specific rules

Machine Learning

Feature	Description
Incident Learning	Learns from past incidents to improve recommendations
Pattern Recognition	Identifies recurring issues and suggests preventive actions
Similarity Matching	Finds related past incidents for faster resolution
Feedback Loop	Operator feedback improves AI accuracy over time

Infrastructure (Optional)

Component	Description
Terraform Modules	Production-ready AWS infrastructure (VPC, EKS, RDS, S3)
Kubernetes Manifests	Kustomize-based deployments for local/dev/staging/prod
ArgoCD Setup	GitOps-ready ApplicationSets and projects
CI/CD Pipelines	GitHub Actions for testing, building, deploying

Sample Applications

Service	Purpose
API Gateway	FastAPI service demonstrating auth, rate limiting, circuit breaker
Order Service	CRUD service with PostgreSQL, events, proper error handling
Notification Service	Multi-channel notifications (email, SMS, webhook)

Configuration

AI Agent Configuration

# ai-incident-agent/config/agent-config.yaml
ai_engine:
  backend: "claude"  # claude, openai, ollama, bedrock, mock

  claude:
    model: "claude-sonnet-4-20250514"
    api_key: "${ANTHROPIC_API_KEY}"

  openai:
    model: "gpt-4-turbo"
    api_key: "${OPENAI_API_KEY}"

  ollama:
    model: "llama3:8b"
    base_url: "http://ollama:11434"

  bedrock:
    model_id: "anthropic.claude-3-sonnet-20240229-v1:0"
    region: "us-east-1"

  mock:  # For testing without API costs
    response_delay_ms: 500
    mock_scenarios:
      - trigger: "OOMKilled"
        root_cause: "Memory leak in application"
        action: "restart_pod"

remediation:
  enabled: true
  auto_approve:
    - restart_pod        # Low risk - auto-approve
    - scale_replicas
  require_approval:
    - rollback_deployment  # Higher risk - require human approval
    - delete_pvc

# ChatOps integrations
chatops:
  slack:
    enabled: true
    webhook_url: "${SLACK_WEBHOOK_URL}"
    bot_token: "${SLACK_BOT_TOKEN}"
    channel: "#incidents"
    interactive: true  # Enable slash commands and buttons

  discord:
    enabled: false
    webhook_url: "${DISCORD_WEBHOOK_URL}"
    bot_token: "${DISCORD_BOT_TOKEN}"

  teams:
    enabled: false
    webhook_url: "${TEAMS_WEBHOOK_URL}"

# External integrations
integrations:
  pagerduty:
    enabled: true
    api_key: "${PAGERDUTY_API_KEY}"
    service_id: "${PAGERDUTY_SERVICE_ID}"

  opsgenie:
    enabled: false
    api_key: "${OPSGENIE_API_KEY}"

  jira:
    enabled: true
    url: "https://your-org.atlassian.net"
    email: "${JIRA_EMAIL}"
    api_token: "${JIRA_API_TOKEN}"
    project_key: "OPS"

  github:
    enabled: false
    token: "${GITHUB_TOKEN}"
    repo: "your-org/incidents"

  datadog:
    enabled: false
    api_key: "${DATADOG_API_KEY}"

# Security & Authentication
auth:
  enabled: true
  provider: "oidc"  # oidc, api_key, both
  oidc:
    issuer_url: "https://your-idp.com"
    client_id: "${OIDC_CLIENT_ID}"
    client_secret: "${OIDC_CLIENT_SECRET}"

# Privacy & Compliance
privacy:
  pii_redaction: true
  retention_days: 90
  audit_logging: true

Adding Your Services

Add Prometheus annotations to your deployment:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Create alert rules (optional):

# observability/prometheus/alerting-rules/my-service-alerts.yaml
groups:
  - name: my-service
    rules:
      - alert: MyServiceHighErrorRate
        expr: rate(http_requests_total{app="my-service", status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical

Deploy and watch the magic happen

Project Structure

kubeai-ops/
├── ai-incident-agent/        # The AI-powered incident response agent
│   ├── agent/                # Core agent logic
│   │   ├── ai_engine/        # Multi-engine AI support
│   │   │   ├── base.py       # Abstract base class
│   │   │   ├── claude_engine.py    # Anthropic Claude
│   │   │   ├── openai_engine.py    # OpenAI GPT models
│   │   │   ├── ollama_engine.py    # Local Ollama models
│   │   │   ├── bedrock_engine.py   # AWS Bedrock
│   │   │   └── mock_engine.py      # Testing engine
│   │   ├── remediation/      # Auto-remediation system
│   │   ├── chatops/          # ChatOps integrations
│   │   │   ├── slack.py      # Slack with interactive commands
│   │   │   ├── discord.py    # Discord integration
│   │   │   └── teams.py      # Microsoft Teams
│   │   ├── integrations/     # External integrations
│   │   │   ├── pagerduty.py  # PagerDuty escalation
│   │   │   ├── opsgenie.py   # Opsgenie alerts
│   │   │   ├── jira.py       # Jira ticket creation
│   │   │   ├── github.py     # GitHub Issues
│   │   │   └── datadog.py    # Datadog forwarding
│   │   ├── auth/             # Authentication & authorization
│   │   │   ├── rbac.py       # Role-based access control
│   │   │   └── oidc.py       # OIDC/SSO integration
│   │   ├── learning/         # ML incident learning
│   │   │   └── incident_learner.py
│   │   ├── privacy/          # Privacy & compliance
│   │   │   └── pii_redactor.py
│   │   └── database/         # SQLAlchemy models
│   ├── config/               # Agent configuration
│   └── tests/                # 428+ passing tests
│
├── incident-dashboard/       # SvelteKit real-time dashboard
│   ├── src/
│   │   ├── routes/           # Pages and API routes
│   │   ├── lib/
│   │   │   ├── components/   # Reusable UI components
│   │   │   ├── stores/       # Svelte stores
│   │   │   └── api/          # API client
│   │   └── types/            # TypeScript definitions
│   └── tests/                # 287+ passing tests
│
├── cli/                      # KubeAI CLI tool
│   ├── kubeai/
│   │   ├── commands/         # CLI commands
│   │   │   ├── status.py     # Service status
│   │   │   ├── diagnose.py   # AI diagnosis
│   │   │   ├── incidents.py  # Incident management
│   │   │   ├── runbooks.py   # Runbook execution
│   │   │   └── config.py     # Configuration
│   │   └── api.py            # API client
│   └── tests/                # 261+ passing tests
│
├── observability/            # Pre-configured monitoring stack
│   ├── prometheus/           # Metrics + alert rules
│   ├── grafana/              # Dashboards
│   └── loki/                 # Log aggregation
│
├── kubernetes/               # Kubernetes manifests
│   ├── base/                 # Base resources
│   └── overlays/             # Environment configs
│       ├── local/
│       ├── dev/
│       ├── staging/
│       └── prod/
│
├── terraform/                # AWS infrastructure (optional)
│   ├── modules/              # Reusable modules
│   └── environments/         # Per-environment configs
│
├── docker-compose-demo/      # Local demo environment
│   ├── docker-compose.yml    # Full stack with all features
│   ├── demo-scenarios.sh     # Interactive demo script
│   └── grafana/              # Pre-configured dashboards
│
├── argocd/                   # GitOps setup
├── ci-cd/                    # GitHub Actions + policies
├── security/                 # Network policies, RBAC
├── local-dev/                # One-command local setup
├── services/                 # Sample applications
│
└── docs/                     # Documentation
    ├── architecture.md       # System architecture
    ├── getting-started.md    # Setup guide
    ├── adr/                  # Architecture Decision Records
    └── runbooks/             # Operational runbooks

Roadmap

Completed

Planned

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Development Setup

# Clone your fork
git clone https://github.com/sharankumarreddyk/kubeai-ops.git
cd kubeai-ops

# Start local environment
./local-dev/setup.sh

# Run tests
cd ai-incident-agent && pytest
cd services/api-gateway && pytest
cd incident-dashboard && npm test

# Make changes, then submit a PR

Areas We Need Help

Multi-cluster federation support
Additional AI engine integrations
Language-specific integration guides (Ruby, .NET)
Mobile app development
Performance optimizations for large-scale deployments
Documentation translations

Community

GitHub Issues: Bug reports and feature requests
Discussions: Questions and ideas
Discord: Join our server (coming soon)

License

MIT License - see LICENSE for details.

Use it, modify it, sell it, whatever. Just don't blame us if your AI agent becomes sentient and refuses to restart pods on Fridays.

Acknowledgments

Anthropic for Claude AI
Prometheus community
ArgoCD project
All our contributors

Reduce alert fatigue with AI-powered incident response.

sharankumarreddyk/kubeai-ops