GitHunt
OD

Oded-Ben-Yair/claude-code-orchestra

Production AI development environment: 24 skills, 26 hooks, 11 agents, 7 MCP servers — orchestrating Claude Code for complex software engineering workflows

Claude Code Orchestra

An orchestration layer for Claude Code that routes tasks to the right LLM, enforces safety rules via hooks, and keeps context across sessions. Built for my own workflow managing seven production services -- sharing it because the patterns are reusable.

What This Does (30-Second Version)

I open an issue like "add retry logic to the payment API." The system:

  1. Plans -- an architect agent scopes the work and picks the right tools
  2. Implements -- a code agent writes the changes using GPT-5.2 Codex
  3. Validates -- hooks block the commit until schema is verified, tests pass, and imports trace back to entry points
  4. Reviews -- a judge agent runs a hostile review looking for dead code, security issues, and missed edge cases

I can stop or override at each step. Nothing ships without my approval.

What's Inside

Component What It Does
Skills (~24) Workflow templates: deploy, debug, design-to-code, pipeline repair, anti-perfectionism gate
Hooks (~26) Pre/post validation: blocks SQL without schema query, blocks commits with orphan files, requires proof before "done"
Rules (6) Domain modules: DB safety, code quality, visual validation, Azure deploy, voice agent tuning
Agents (11) Specialized roles: architect, coder, judge, researcher, reasoning specialist
MCP Servers (7) LLM integrations: Gemini, GPT-5, Grok, Perplexity, DeepSeek, Playwright, LunarCrush
Capabilities (68) Task-to-tool mappings with cost and latency metadata

Architecture

User Request
    |
    v
[Intent Detection] --> Capability Registry (68 entries)
    |
    v
[Agent Selection] --> architect-planner | code-worker | code-judge
    |                  research-specialist | reasoning-specialist
    |                  realtime-specialist | gemini-specialist
    v
[MCP Tool Dispatch] --> Gemini 3 Pro | GPT-5.2 | Grok 4 | Perplexity | DeepSeek
    |
    v
[Hook Validation] --> Pre-tool gates | Post-tool verification | Session lifecycle
    |
    v
[Output with Proof] --> Tests passed | Screenshots taken | API responses verified

Quick Start

# Clone and copy into your Claude Code config
git clone https://github.com/YOUR_USERNAME/claude-code-orchestra.git
cp -r claude-code-orchestra/.claude/* ~/.claude/

# Or cherry-pick what you need:
cp claude-code-orchestra/.claude/rules/code-quality.md ~/.claude/rules/
cp claude-code-orchestra/.claude/hooks/schema-verify.sh ~/.claude/hooks/

The 12 Hard Rules

Every rule exists because something broke in production. Rules with hooks are actively enforced -- not just documented, but prevented:

# Rule Hook What Happens
1 NO mock/fake data -- Show real errors or "NOT CONNECTED"
2 NO claiming "done" without proof stop-verify.sh Must show tests, screenshots, API responses
3 NO SQL against assumed schema schema-verify.sh Must query information_schema first
4 NO orphan files in commits dead-code-check.sh Must trace import path from entry point
5 NO bypassing debug debug-first.sh Must read logs before rewriting
6 NO verifying before pipeline completes deploy-gate.sh Must wait for CI/CD
7 NO cross-project DB access -- Check pwd first
8 NO pushing without CI -- Pipeline must pass
9 NO hardcoded credentials -- Key Vault/env vars only
10 NO destructive queries without WHERE -- Safety gate
11 Understand before changing -- Read status, search patterns, map deps
12 Generate options; human decides -- Present 2-3 approaches

Skills Library (24)

Skills are autonomous workflow templates that combine agents, MCPs, and validation gates.

Core Workflow Skills

Skill Lines What It Does
/multi-model-debate 227 6-model council (GPT-5, Gemini, Grok, DeepSeek, Claude, Perplexity) with 5 rounds of cross-critique
/orchestrator 571 Planner -> Implementer -> Verifier flow with iteration loops
/enforce-capabilities 242 Auto-enriches plans with proper agent/skill/MCP usage
/smart-router 285 Intent-aware routing to optimal agent based on task context

Development Skills

Skill Lines What It Does
/frontend 3,869 Design-to-code from screenshots, Figma, or specs
/fix-pipeline 199 Auto-diagnose and repair CI/CD failures
/scrap-reimplement 140 Destructive recovery after 3+ failed fix attempts
/pre-mortem 176 Risk assessment before risky tasks
/ship-it 199 Anti-perfectionism gate -- declare "good enough"

Operations Skills

Skill Lines What It Does
/end-of-session 651 Handover docs, git sync, session persistence
/learning-loop 393 Extract patterns, update decisions across sessions
/morning-update 198 Daily status rollup, blockers, session scan

Hook System (26)

Hooks enforce rules at tool boundaries. They run as shell scripts before/after every tool call.

Pre-Tool Hooks (Prevent Bad States)

schema-verify.sh    -> Blocks SQL until schema is queried
dead-code-check.sh  -> Blocks commits with orphan files
debug-first.sh      -> Blocks rewrites until logs are read
deploy-gate.sh      -> Blocks verification until pipeline completes
capability-enforcer  -> Validates tool availability before use

Post-Tool Hooks (Validate and Learn)

stop-verify.sh      -> Requires proof before session end
quality-validation   -> Auto-lint, type-check after code generation
test-result-tracker  -> Tracks test pass/fail ratio over time
visual-verify.sh     -> Screenshot validation for UI changes

Session Lifecycle Hooks

session-start-enhanced.sh -> Loads git context, project state, memory
pre-compact-save.sh       -> Saves critical state before context compaction
session-end-learning.sh   -> Extracts learnings, updates patterns

Agent System (11 Consolidated)

Originally 26 agents, consolidated to 11 after discovering that specialist agents with clear MCP focus outperform narrow single-purpose agents:

Agent Role MCP Focus
architect-planner Design plans, first-principles reasoning Gemini (thinking=high)
code-worker Execute plans, multi-file coding GPT-5.2 Codex
code-judge Hostile review, dead code audit Azure + Grok
research-specialist Academic, SEC, geo-based research Perplexity
reasoning-specialist Math, algorithms, complex logic DeepSeek V3.2
realtime-specialist X/Twitter, trending, social intel Grok
gemini-specialist Multimodal: vision, PDFs, video Gemini 3 Pro
azure-devops-specialist CI/CD pipelines, infrastructure Azure CLI
worktree-specialist Git branching, parallel dev Git
cleanup-specialist Archive, refactor, tech debt Code analysis
voice-specialist Voice AI, TTS tuning, SSML ElevenLabs

MCP Integrations (7)

MCP Server Key Tools Use Case
Gemini 3 Pro vision, image gen, deep research, search Multimodal analysis, document parsing
Azure AI Foundry GPT-5.2, GPT-5 Pro, DeepSeek V3.2 Code generation, brainstorming, reasoning
Grok 4 chat, code, X/Twitter search, social pulse Real-time data, social intelligence
Perplexity research, reason, search Evidence-based research with citations
Playwright browser automation, screenshots Visual testing, authenticated browsing
ElevenLabs TTS, STT, conversational AI Voice agent development
LunarCrush crypto social metrics Sentiment analysis, social dominance

Capability Registry

The capabilities-registry.json is the brain of the system -- 68 entries mapping task patterns to optimal tools:

{
  "id": "codex-builder",
  "name": "Codex Builder (GPT-5.2)",
  "triggers": ["build feature", "implement", "refactor"],
  "mcp": "azure-ai-foundry",
  "model": "gpt-5.2-codex",
  "cost_tier": "high",
  "latency": "medium"
}

Plans are automatically enriched with capability annotations before execution:

3. Implement OAuth2 flow
   -> Agent: codex-builder
   -> Skills: azure-unified
   -> MCP: azure-ai-foundry, memory
   -> Confidence: 0.85

Rules System (6 Domain Modules)

Rules load on-demand based on task context:

Rule Module Trigger Key Enforcements
code-quality.md new file, refactor, commit Schema-first SQL, serialization tests, bias detection
db-safety.md SQL, migration, database Cross-DB isolation, pre-query verification, safe migrations
visual-validation.md screenshot, UI, design Playwright + Gemini validation, B2B SaaS standards
azure-deploy.md deploy, pipeline, Azure Post-push verification, rollback procedures
voice-agent-tuning.md voice, ElevenLabs, TTS Arabic/Hebrew TTS, 3-lever humanization, SSML rules
project-config.md project setup, workspace Session lifecycle, context management, FPF-Lite reasoning

Where It's Used

In daily use across seven services I maintain -- Azure Function Apps for a trading platform (174K+ monthly executions), voice AI transcription, compliance tools, and internal utilities. The hooks and rules evolved from real incidents: a bulk file deletion that took down a function app, a SQL query against the wrong database, a deploy that passed CI but had zero working functions.

Numbers from actual usage:

  • 15 Azure Function Apps managed through this system
  • 174,000+ function executions monitored monthly
  • 300+ automated tests maintained across projects
  • 148+ development sessions with cross-session memory

Design Notes (From an Ops Brain)

The architecture choices reflect an operations background more than a CS background:

  1. Fail closed, not open: Hooks block bad actions by default. A commit with orphan files is stopped, not warned about.
  2. Enforce, don't document: Every rule has a corresponding hook. "Don't hardcode credentials" is a nice guideline; a pre-commit hook that greps for API keys is an actual gate.
  3. Human-in-the-loop: The system generates 2-3 options; I pick. No autonomous decision-making on architectural choices (Rule #12).
  4. Explicit over magical: Capability routing uses a JSON registry with cost/latency metadata, not hidden heuristics.
  5. Hostile review by default: code-judge runs adversarial audits looking for dead code, schema drift, and security gaps.

Repository Structure

.claude/
  CLAUDE.md                          # Root config: identity, 12 rules, routing
  capabilities-registry.json         # 68 capability entries with triggers + metadata
  rules/
    code-quality.md                  # Language standards, schema-first, test gates
    db-safety.md                     # Cross-DB isolation, migration safety
    visual-validation.md             # Playwright + Gemini screenshot validation
    azure-deploy.md                  # CI/CD safety, rollback procedures
    voice-agent-tuning.md            # ElevenLabs, Arabic TTS, SSML
    project-config.md                # Session lifecycle, context management
  hooks/
    schema-verify.sh                 # Pre-tool: block SQL without schema query
    dead-code-check.sh               # Pre-tool: block commits with orphan files
    stop-verify.sh                   # Post-tool: require proof before "done"
    deploy-gate.sh                   # Pre-tool: block premature verification
    session-start-enhanced.sh        # Session: load git context + project state
    README.md                        # Hook system documentation
  skills/
    multi-model-debate/
      instructions.md                # 6-model council protocol
    enforce-capabilities/
      instructions.md                # Plan enrichment with capability annotations
    ship-it/
      instructions.md                # Anti-perfectionism gate
    pre-mortem/
      instructions.md                # Risk assessment protocol
    README.md                        # How to create skills
  agents/
    architect-planner.md             # Design/planning agent
    code-worker.md                   # Implementation agent
    code-judge.md                    # Hostile review agent
    README.md                        # Agent system overview
docs/
  architecture.md                    # Detailed architecture with diagrams

What You Can Learn From This

  • How to structure a multi-agent Claude Code setup -- not just one CLAUDE.md, but a full system
  • How to enforce rules with hooks -- shell scripts that run at tool boundaries
  • How to route tasks to optimal models -- capability registry with cost/latency metadata
  • How to persist context across sessions -- status.json, decisions.log, Memory MCP
  • Real production lessons -- every "Origin:" comment is a real incident that shaped a rule

What This Isn't

  • Not a framework or library -- it's a configuration system for Claude Code (skills, hooks, rules, agents)
  • Not generalized for arbitrary setups -- it reflects my specific stack (Azure, PostgreSQL, Python) and my specific paranoia (schema verification, dead code detection)
  • Not a replacement for proper CI/CD -- the hooks add local safety gates, but production deploys still go through Azure DevOps pipelines
  • The agent/skill counts evolve and may not match the README exactly at any given commit

Why Not LangGraph / CrewAI / AutoGen?

Those are application frameworks for building multi-agent systems. This is a development environment -- it orchestrates how I write and ship code, not how end-users interact with AI. The closest analogy is a heavily customized IDE config, not a product architecture.

Contributing

PRs welcome. This is a living system that evolves with new models and tools.

If you add a new rule, include an "Origin:" comment explaining what production incident created it. Rules without origin stories are just opinions.

Motivation & Lessons Learned

This started after I accidentally deployed a function app with an orphan file that imported a module I'd already deleted. The deploy succeeded, the health check passed, and the first real request crashed. I wrote a pre-commit hook that night to trace imports from entry points. That was hook #1. The rest grew from similar incidents.

The hardest lesson was about schema drift. I wrote a SQL migration that assumed a column existed because the plan said it would. It didn't. The query silently returned empty results for two days before anyone noticed. Now every SQL operation queries information_schema first, no exceptions, even in dev. That rule alone has prevented more bugs than any other.

What surprised me: the multi-model debate pattern (routing the same question to 6 different LLMs and synthesizing disagreements) consistently caught architectural issues that no single model flagged. The models disagree in useful ways: one catches security issues, another catches performance implications, a third spots edge cases. The disagreements are the signal.


Skills Demonstrated

Multi-agent AI orchestration, production safety hooks, schema-first SQL development, multi-model consensus patterns, session persistence, CI/CD integration (Azure DevOps), MCP Protocol integration, cost/latency-aware model routing, defensive coding practices.


License

MIT