Oded-Ben-Yair/claude-code-orchestra
Production AI development environment: 24 skills, 26 hooks, 11 agents, 7 MCP servers — orchestrating Claude Code for complex software engineering workflows
Claude Code Orchestra
An orchestration layer for Claude Code that routes tasks to the right LLM, enforces safety rules via hooks, and keeps context across sessions. Built for my own workflow managing seven production services -- sharing it because the patterns are reusable.
What This Does (30-Second Version)
I open an issue like "add retry logic to the payment API." The system:
- Plans -- an architect agent scopes the work and picks the right tools
- Implements -- a code agent writes the changes using GPT-5.2 Codex
- Validates -- hooks block the commit until schema is verified, tests pass, and imports trace back to entry points
- Reviews -- a judge agent runs a hostile review looking for dead code, security issues, and missed edge cases
I can stop or override at each step. Nothing ships without my approval.
What's Inside
| Component | What It Does |
|---|---|
| Skills (~24) | Workflow templates: deploy, debug, design-to-code, pipeline repair, anti-perfectionism gate |
| Hooks (~26) | Pre/post validation: blocks SQL without schema query, blocks commits with orphan files, requires proof before "done" |
| Rules (6) | Domain modules: DB safety, code quality, visual validation, Azure deploy, voice agent tuning |
| Agents (11) | Specialized roles: architect, coder, judge, researcher, reasoning specialist |
| MCP Servers (7) | LLM integrations: Gemini, GPT-5, Grok, Perplexity, DeepSeek, Playwright, LunarCrush |
| Capabilities (68) | Task-to-tool mappings with cost and latency metadata |
Architecture
User Request
|
v
[Intent Detection] --> Capability Registry (68 entries)
|
v
[Agent Selection] --> architect-planner | code-worker | code-judge
| research-specialist | reasoning-specialist
| realtime-specialist | gemini-specialist
v
[MCP Tool Dispatch] --> Gemini 3 Pro | GPT-5.2 | Grok 4 | Perplexity | DeepSeek
|
v
[Hook Validation] --> Pre-tool gates | Post-tool verification | Session lifecycle
|
v
[Output with Proof] --> Tests passed | Screenshots taken | API responses verified
Quick Start
# Clone and copy into your Claude Code config
git clone https://github.com/YOUR_USERNAME/claude-code-orchestra.git
cp -r claude-code-orchestra/.claude/* ~/.claude/
# Or cherry-pick what you need:
cp claude-code-orchestra/.claude/rules/code-quality.md ~/.claude/rules/
cp claude-code-orchestra/.claude/hooks/schema-verify.sh ~/.claude/hooks/The 12 Hard Rules
Every rule exists because something broke in production. Rules with hooks are actively enforced -- not just documented, but prevented:
| # | Rule | Hook | What Happens |
|---|---|---|---|
| 1 | NO mock/fake data | -- | Show real errors or "NOT CONNECTED" |
| 2 | NO claiming "done" without proof | stop-verify.sh |
Must show tests, screenshots, API responses |
| 3 | NO SQL against assumed schema | schema-verify.sh |
Must query information_schema first |
| 4 | NO orphan files in commits | dead-code-check.sh |
Must trace import path from entry point |
| 5 | NO bypassing debug | debug-first.sh |
Must read logs before rewriting |
| 6 | NO verifying before pipeline completes | deploy-gate.sh |
Must wait for CI/CD |
| 7 | NO cross-project DB access | -- | Check pwd first |
| 8 | NO pushing without CI | -- | Pipeline must pass |
| 9 | NO hardcoded credentials | -- | Key Vault/env vars only |
| 10 | NO destructive queries without WHERE | -- | Safety gate |
| 11 | Understand before changing | -- | Read status, search patterns, map deps |
| 12 | Generate options; human decides | -- | Present 2-3 approaches |
Skills Library (24)
Skills are autonomous workflow templates that combine agents, MCPs, and validation gates.
Core Workflow Skills
| Skill | Lines | What It Does |
|---|---|---|
/multi-model-debate |
227 | 6-model council (GPT-5, Gemini, Grok, DeepSeek, Claude, Perplexity) with 5 rounds of cross-critique |
/orchestrator |
571 | Planner -> Implementer -> Verifier flow with iteration loops |
/enforce-capabilities |
242 | Auto-enriches plans with proper agent/skill/MCP usage |
/smart-router |
285 | Intent-aware routing to optimal agent based on task context |
Development Skills
| Skill | Lines | What It Does |
|---|---|---|
/frontend |
3,869 | Design-to-code from screenshots, Figma, or specs |
/fix-pipeline |
199 | Auto-diagnose and repair CI/CD failures |
/scrap-reimplement |
140 | Destructive recovery after 3+ failed fix attempts |
/pre-mortem |
176 | Risk assessment before risky tasks |
/ship-it |
199 | Anti-perfectionism gate -- declare "good enough" |
Operations Skills
| Skill | Lines | What It Does |
|---|---|---|
/end-of-session |
651 | Handover docs, git sync, session persistence |
/learning-loop |
393 | Extract patterns, update decisions across sessions |
/morning-update |
198 | Daily status rollup, blockers, session scan |
Hook System (26)
Hooks enforce rules at tool boundaries. They run as shell scripts before/after every tool call.
Pre-Tool Hooks (Prevent Bad States)
schema-verify.sh -> Blocks SQL until schema is queried
dead-code-check.sh -> Blocks commits with orphan files
debug-first.sh -> Blocks rewrites until logs are read
deploy-gate.sh -> Blocks verification until pipeline completes
capability-enforcer -> Validates tool availability before use
Post-Tool Hooks (Validate and Learn)
stop-verify.sh -> Requires proof before session end
quality-validation -> Auto-lint, type-check after code generation
test-result-tracker -> Tracks test pass/fail ratio over time
visual-verify.sh -> Screenshot validation for UI changes
Session Lifecycle Hooks
session-start-enhanced.sh -> Loads git context, project state, memory
pre-compact-save.sh -> Saves critical state before context compaction
session-end-learning.sh -> Extracts learnings, updates patterns
Agent System (11 Consolidated)
Originally 26 agents, consolidated to 11 after discovering that specialist agents with clear MCP focus outperform narrow single-purpose agents:
| Agent | Role | MCP Focus |
|---|---|---|
architect-planner |
Design plans, first-principles reasoning | Gemini (thinking=high) |
code-worker |
Execute plans, multi-file coding | GPT-5.2 Codex |
code-judge |
Hostile review, dead code audit | Azure + Grok |
research-specialist |
Academic, SEC, geo-based research | Perplexity |
reasoning-specialist |
Math, algorithms, complex logic | DeepSeek V3.2 |
realtime-specialist |
X/Twitter, trending, social intel | Grok |
gemini-specialist |
Multimodal: vision, PDFs, video | Gemini 3 Pro |
azure-devops-specialist |
CI/CD pipelines, infrastructure | Azure CLI |
worktree-specialist |
Git branching, parallel dev | Git |
cleanup-specialist |
Archive, refactor, tech debt | Code analysis |
voice-specialist |
Voice AI, TTS tuning, SSML | ElevenLabs |
MCP Integrations (7)
| MCP Server | Key Tools | Use Case |
|---|---|---|
| Gemini 3 Pro | vision, image gen, deep research, search | Multimodal analysis, document parsing |
| Azure AI Foundry | GPT-5.2, GPT-5 Pro, DeepSeek V3.2 | Code generation, brainstorming, reasoning |
| Grok 4 | chat, code, X/Twitter search, social pulse | Real-time data, social intelligence |
| Perplexity | research, reason, search | Evidence-based research with citations |
| Playwright | browser automation, screenshots | Visual testing, authenticated browsing |
| ElevenLabs | TTS, STT, conversational AI | Voice agent development |
| LunarCrush | crypto social metrics | Sentiment analysis, social dominance |
Capability Registry
The capabilities-registry.json is the brain of the system -- 68 entries mapping task patterns to optimal tools:
{
"id": "codex-builder",
"name": "Codex Builder (GPT-5.2)",
"triggers": ["build feature", "implement", "refactor"],
"mcp": "azure-ai-foundry",
"model": "gpt-5.2-codex",
"cost_tier": "high",
"latency": "medium"
}Plans are automatically enriched with capability annotations before execution:
3. Implement OAuth2 flow
-> Agent: codex-builder
-> Skills: azure-unified
-> MCP: azure-ai-foundry, memory
-> Confidence: 0.85
Rules System (6 Domain Modules)
Rules load on-demand based on task context:
| Rule Module | Trigger | Key Enforcements |
|---|---|---|
code-quality.md |
new file, refactor, commit | Schema-first SQL, serialization tests, bias detection |
db-safety.md |
SQL, migration, database | Cross-DB isolation, pre-query verification, safe migrations |
visual-validation.md |
screenshot, UI, design | Playwright + Gemini validation, B2B SaaS standards |
azure-deploy.md |
deploy, pipeline, Azure | Post-push verification, rollback procedures |
voice-agent-tuning.md |
voice, ElevenLabs, TTS | Arabic/Hebrew TTS, 3-lever humanization, SSML rules |
project-config.md |
project setup, workspace | Session lifecycle, context management, FPF-Lite reasoning |
Where It's Used
In daily use across seven services I maintain -- Azure Function Apps for a trading platform (174K+ monthly executions), voice AI transcription, compliance tools, and internal utilities. The hooks and rules evolved from real incidents: a bulk file deletion that took down a function app, a SQL query against the wrong database, a deploy that passed CI but had zero working functions.
Numbers from actual usage:
- 15 Azure Function Apps managed through this system
- 174,000+ function executions monitored monthly
- 300+ automated tests maintained across projects
- 148+ development sessions with cross-session memory
Design Notes (From an Ops Brain)
The architecture choices reflect an operations background more than a CS background:
- Fail closed, not open: Hooks block bad actions by default. A commit with orphan files is stopped, not warned about.
- Enforce, don't document: Every rule has a corresponding hook. "Don't hardcode credentials" is a nice guideline; a pre-commit hook that greps for API keys is an actual gate.
- Human-in-the-loop: The system generates 2-3 options; I pick. No autonomous decision-making on architectural choices (Rule #12).
- Explicit over magical: Capability routing uses a JSON registry with cost/latency metadata, not hidden heuristics.
- Hostile review by default:
code-judgeruns adversarial audits looking for dead code, schema drift, and security gaps.
Repository Structure
.claude/
CLAUDE.md # Root config: identity, 12 rules, routing
capabilities-registry.json # 68 capability entries with triggers + metadata
rules/
code-quality.md # Language standards, schema-first, test gates
db-safety.md # Cross-DB isolation, migration safety
visual-validation.md # Playwright + Gemini screenshot validation
azure-deploy.md # CI/CD safety, rollback procedures
voice-agent-tuning.md # ElevenLabs, Arabic TTS, SSML
project-config.md # Session lifecycle, context management
hooks/
schema-verify.sh # Pre-tool: block SQL without schema query
dead-code-check.sh # Pre-tool: block commits with orphan files
stop-verify.sh # Post-tool: require proof before "done"
deploy-gate.sh # Pre-tool: block premature verification
session-start-enhanced.sh # Session: load git context + project state
README.md # Hook system documentation
skills/
multi-model-debate/
instructions.md # 6-model council protocol
enforce-capabilities/
instructions.md # Plan enrichment with capability annotations
ship-it/
instructions.md # Anti-perfectionism gate
pre-mortem/
instructions.md # Risk assessment protocol
README.md # How to create skills
agents/
architect-planner.md # Design/planning agent
code-worker.md # Implementation agent
code-judge.md # Hostile review agent
README.md # Agent system overview
docs/
architecture.md # Detailed architecture with diagrams
What You Can Learn From This
- How to structure a multi-agent Claude Code setup -- not just one CLAUDE.md, but a full system
- How to enforce rules with hooks -- shell scripts that run at tool boundaries
- How to route tasks to optimal models -- capability registry with cost/latency metadata
- How to persist context across sessions -- status.json, decisions.log, Memory MCP
- Real production lessons -- every "Origin:" comment is a real incident that shaped a rule
What This Isn't
- Not a framework or library -- it's a configuration system for Claude Code (skills, hooks, rules, agents)
- Not generalized for arbitrary setups -- it reflects my specific stack (Azure, PostgreSQL, Python) and my specific paranoia (schema verification, dead code detection)
- Not a replacement for proper CI/CD -- the hooks add local safety gates, but production deploys still go through Azure DevOps pipelines
- The agent/skill counts evolve and may not match the README exactly at any given commit
Why Not LangGraph / CrewAI / AutoGen?
Those are application frameworks for building multi-agent systems. This is a development environment -- it orchestrates how I write and ship code, not how end-users interact with AI. The closest analogy is a heavily customized IDE config, not a product architecture.
Contributing
PRs welcome. This is a living system that evolves with new models and tools.
If you add a new rule, include an "Origin:" comment explaining what production incident created it. Rules without origin stories are just opinions.
Motivation & Lessons Learned
This started after I accidentally deployed a function app with an orphan file that imported a module I'd already deleted. The deploy succeeded, the health check passed, and the first real request crashed. I wrote a pre-commit hook that night to trace imports from entry points. That was hook #1. The rest grew from similar incidents.
The hardest lesson was about schema drift. I wrote a SQL migration that assumed a column existed because the plan said it would. It didn't. The query silently returned empty results for two days before anyone noticed. Now every SQL operation queries information_schema first, no exceptions, even in dev. That rule alone has prevented more bugs than any other.
What surprised me: the multi-model debate pattern (routing the same question to 6 different LLMs and synthesizing disagreements) consistently caught architectural issues that no single model flagged. The models disagree in useful ways: one catches security issues, another catches performance implications, a third spots edge cases. The disagreements are the signal.
Skills Demonstrated
Multi-agent AI orchestration, production safety hooks, schema-first SQL development, multi-model consensus patterns, session persistence, CI/CD integration (Azure DevOps), MCP Protocol integration, cost/latency-aware model routing, defensive coding practices.
License
MIT