Sibyl-Research-Team/sibyl-research-system
Fully Autonomous AI Research System with Self-Evolution, built natively on Claude Code
Sibyl Research System
Fully Autonomous AI Scientist · From Idea to Paper, Zero Human Intervention
Multi-Agent Scientific Discovery · GPU Experiment Execution · Self-Evolving Research Pipeline
Inspired by the pioneering work of The AI Scientist, FARS, and AutoResearch, Sibyl takes the vision further by building natively on Claude Code to fully leverage its agent ecosystem — skills, plugins, MCP servers, and multi-agent teams.
Sibyl is a fully autonomous AI scientist that drives end-to-end ML research — from literature survey and hypothesis generation to GPU experiment execution and conference-ready paper writing. It operates as an autonomous research organization: 20+ specialized AI agents debate ideas, design and run GPU experiments, write papers, and critically review their own work — all without human intervention.
Key capabilities: automated literature review, multi-agent idea debate, experiment planning & GPU-parallel execution, multi-agent paper writing & peer review, autonomous iteration with quality gates, and cross-project self-evolution. Supports NeurIPS/ICML/ICLR-level output with LaTeX compilation.
What truly sets Sibyl apart is its dual-loop architecture:
- Inner Loop — Research Iteration: Each project automatically iterates across every dimension — refining hypotheses based on experiment results, re-planning experiments, rewriting papers, pivoting to alternative ideas when needed — until quality meets publication standards.
- Outer Loop — System Self-Evolution: Sibyl learns from the research process itself. After every iteration, it classifies issues across 8 categories, accumulates reusable lessons, and automatically updates its own agent prompts, scheduling strategies, and architectural patterns. The system that runs your research is itself getting better at running research.
What Makes Sibyl Different?
- Autonomous Multi-Dimensional Iteration — Not just "run experiments and write a paper." Every aspect of the research improves automatically across iterations: ideas sharpen through multi-agent debate, experiments expand with better baselines and ablations, writing tightens under 6-agent cross-review, and resource utilization optimizes through GPU scheduling feedback. The quality gate decides when to stop or pivot — no human in the loop.
- Self-Evolving System — Most AI research tools are static — they run the same way every time. Sibyl evolves. It extracts lessons from every research iteration (issues, success patterns, efficiency metrics), keeps them time-weighted and context-filtered, and injects the relevant ones back into agent prompts. Across projects, the system accumulates institutional knowledge — each project makes every future project better.
- Claude Code Native — Not a wrapper around API calls. Built directly on Claude Code's architecture (fork skills, agent teams, MCP tools), inheriting its full ecosystem: SSH remote execution, multi-model collaboration (Claude + GPT-5.4 cross-review), Feishu/Lark cloud sync, and more.
Use Cases
- Automated ML Research — Give Sibyl a research topic or spec, and it autonomously explores ideas, runs experiments on your GPU cluster, and produces a conference-quality paper
- Research Augmentation — Use individual pipeline stages (literature search, idea debate, experiment planning) as standalone research tools
- Vibe Research — Describe what you want to explore in natural language; Sibyl handles the rest (what is vibe research?)
Get Started
Recommended: Let Claude Configure Everything
The fastest way to set up Sibyl is to let Claude Code do it for you. Clone the repo, open it in Claude Code, and ask:
git clone https://github.com/Sibyl-Research-Team/sibyl-research-system.git
cd sibyl-research-system
tmux new -s sibyl # recommended: persistent session
claude --plugin-dir ./plugin --dangerously-skip-permissions
⚠️ --dangerously-skip-permissionsgrants Claude Code unrestricted execution (shell commands, file I/O, MCP calls) without confirmation. It is strongly recommended for Sibyl's autonomous multi-agent workflow (hundreds of tool calls per iteration), but should only be used on dedicated research machines. See Manual Setup for full details and mitigation advice.
Then tell Claude:
"Help me set up Sibyl Research System. Read docs/setup-guide.md and configure everything."
Claude will automatically check your environment, install dependencies, configure MCP servers, create config files, and ask you only for what it can't detect (GPU server IP, username, etc.). The setup guide is a step-by-step checklist designed for Claude to follow.
Once setup is complete, run the init command inside Claude Code to verify the installation and prepare your first workspace:
/sibyl-research:init
Manual Setup
Click to expand manual setup instructions
Prerequisites
- Python 3.12+, Node.js 18+
- Claude Code CLI
- GPU server with SSH access
ANTHROPIC_API_KEYenvironment variableCLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1environment variable- tmux (strongly recommended) — enables persistent sessions and automatic recovery via Sentinel watchdog. Install:
brew install tmux(macOS) /apt install tmux(Linux)
1. Install
git clone https://github.com/Sibyl-Research-Team/sibyl-research-system.git
cd sibyl-research-system
chmod +x setup.sh && ./setup.sh # Interactive: creates venv, installs deps, configures MCPsetup.sh also adds or updates export SIBYL_ROOT="..." in your shell rc file so workspace-root Claude sessions can still resolve the repo plugin and tools.
2. Configure MCP Servers
Two MCP servers are required. setup.sh configures them interactively, but for manual setup the preferred path is claude mcp add --scope local ... so the configuration stays repo-scoped:
claude mcp add --scope local ssh-mcp-server -- npx -y @fangjunjie/ssh-mcp-server \
--host YOUR_GPU_IP --port 22 --username YOUR_USER --privateKey ~/.ssh/id_ed25519
claude mcp add --scope local arxiv-mcp-server -- /ABSOLUTE/PATH/TO/sibyl-research-system/.venv/bin/python3 -m arxiv_mcp_serverIf you already manage Claude Code MCP servers through JSON, update the existing MCP config instead of creating a second source of truth:
{
"mcpServers": {
"ssh-mcp-server": {
"command": "npx",
"args": ["-y", "@fangjunjie/ssh-mcp-server",
"--host", "YOUR_GPU_IP", "--port", "22",
"--username", "YOUR_USER",
"--privateKey", "~/.ssh/id_ed25519"]
},
"arxiv-mcp-server": {
"command": "/ABSOLUTE/PATH/TO/sibyl-research-system/.venv/bin/python3",
"args": ["-m", "arxiv_mcp_server"]
}
}
}Server names must be exact:
"ssh-mcp-server"and"arxiv-mcp-server".
3. Configure GPU Server
Create config.yaml at project root (git-ignored):
ssh_server: "default"
remote_base: "/home/user/sibyl_system"
max_gpus: 4
language: zh
codex_enabled: falseUse ssh_server: "default" when ssh-mcp-server was registered with explicit --host/--username arguments. If your MCP setup resolves a named SSH host alias instead, use that alias.
4. Run
# `setup.sh` normally writes this for you; set it manually only if you skipped setup.sh
export SIBYL_ROOT=/path/to/sibyl-system
# Repo root: setup, init, status, migrate, evolve
cd "$SIBYL_ROOT"
tmux new -s sibyl-admin
claude --plugin-dir "$SIBYL_ROOT/plugin" --dangerously-skip-permissions
# Workspace root: actual project execution (recommended)
cd "$SIBYL_ROOT/workspaces/my-project"
tmux new -s sibyl-my-project
claude --plugin-dir "$SIBYL_ROOT/plugin" --dangerously-skip-permissions
# Inside Claude Code (repo root) — run once after installation:
/sibyl-research:init # Verify installation and prepare first workspace
# Inside Claude Code launched from workspaces/my-project:
/sibyl-research:start spec.md # New project from this workspace's spec
/sibyl-research:continue . # Resume the current workspaceWhy tmux? Sibyl experiments can run for hours. Running inside tmux ensures the session persists through terminal disconnections. The Sentinel watchdog (auto-launched by
/sibyl-research:start) runs in a sibling tmux pane and automatically restarts Claude Code if it crashes or goes idle — enabling truly unattended autonomous research.
Which directory should Claude start in? Use the repo root only for setup and global maintenance (
/sibyl-research:init,:status,:migrate,:evolve). For an actual research run, start Claude from the target workspace root (workspaces/<project>/), not from the repo root and not fromworkspaces/<project>/current. This makes Claude load the workspace-specificCLAUDE.md,.claude/links, Ralph prompt, and project memory directly.
Parallel projects: run one Claude session/pane per workspace root. Example: pane A starts in
workspaces/ttt-dlm/, pane B starts inworkspaces/dlm-improve/. Do not reuse the same Claude pane/session across multiple projects; Sibyl now treats pane/session ownership as project-scoped.
Why
--dangerously-skip-permissions? Sibyl orchestrates 20+ agents across 19 pipeline stages, each involving dozens of tool calls (file I/O, SSH commands, MCP server calls, sub-agent spawning). Without this flag, Claude Code will prompt for permission on nearly every operation, making autonomous research impossible — you'd need to approve hundreds of prompts per iteration. The flag skips all permission confirmations, enabling true end-to-end automation.
⚠️ Risks: This flag allows Claude Code to execute any shell command, read/write any file, and make any MCP call without confirmation. Only use it in environments where you trust the system and have reviewed the codebase. Do not use it on machines with sensitive data outside the project directory. Consider running in a container or VM for additional isolation.
Docs: Full Setup Guide · Configuration (35+ options) · MCP Servers · SSH & GPU · All 12 Commands
Recent Updates
2026-03-17
- refactor: Pipeline optimization — merge
writing_critiqueintowriting_integrate, speculative parallelexperiment_decision+writing_outline, Lark sync whitelist, informativequality_gateoutput (3c73b52,4f988dc) - refactor: GPU scheduler optimization — critical path priority, batched SSH stuck detection, auto-retry for recoverable failures,
experiment_state.jsonas authoritative source (5624332,befe35a,b95fe35) - refactor: Reflection + evolution optimization — async
_post_reflection_hook, effectiveness tracking, cross-language synonym normalization for issue dedup (ba1f398) - feat: Codex-guided idea refinement, pluggable compute backend (local + SSH), CLI subcommands, plugin hooks, novelty-checker (
08bbf62,4a32775,809476f) - feat: Writing quality gates, deterministic LaTeX, self-heal daemon, experiment digest (
7638a40,a64d321,d186e69) - fix: GPU scheduling — file locking, lease TTL cleanup, failed task exclusion, stuck task detection, topo sort warnings (
51d0a68,b4333e4,7152866,cce8e3a,86f62d3,fc0b9d6,be67959,ba25925,0c4692f,e082d42,eff0da2)
2026-03-12 – 03-13
- feat: Iteration directories as default, experiment supervisor recovery, orchestra skills for experiment agents (
5ce9df4,8642cda,e960540) - docs: Quick start guide updates, macOS
.gitignore(83ba348,c2ccaa5)
2026-03-11
- refactor: Extract
orchestrate.pyintosibyl/orchestration/package, skills userender_skill_prompt(1fcd1ef,8e23f32) - feat: Evolution locking, global GPU leases, workspace session isolation, sentinel improvements, orchestra external skills (
fb491b4,5733615,f800e93,6428ad8)
2026-03-10
- release: v0.6.1 (
546cf07)
2026-03-08
- feat: 6-agent idea/result debate, evolution engine overhaul (7 categories, time decay, agent-routed overlays), GPU-aware scheduling, parallel review stage, Task UI integration (
d89dd7d,a3be563,572f085,1187d2c,5686a83,fc305c7) - test: 134 pytest tests for orchestrator, workspace, and support modules (
7a52b85) - fix: 12 bug fixes across orchestrator, workspace, and plugin commands (
a82d18c,5e3c5af,2e8b3a3,b29bb23,ffdf1fd,314972e,b0f0199,6a4d438,142642b,21bbfa6)
2026-03-07
- feat: Rewrite to Claude Code native architecture (v0.5.0) — LaTeX pipeline, Feishu upload, server migration (
918ae17,537d26c,51c18d8)
System Overview
Sibyl orchestrates 20+ AI agents through a 19-stage state-machine pipeline, automatically completing literature survey, idea generation, experiment design & execution, result analysis, paper writing, and peer review. The system supports multi-round iterative optimization with built-in cross-project learning that continuously improves research quality.
Core Features
- 19-Stage Research Pipeline: End-to-end automation from literature search to camera-ready paper
- Multi-Agent Collaboration: 6-agent debate for idea generation, 6-agent result analysis, 6-agent parallel writing
- GPU-Parallel Scheduling: Topological sort + dynamic dispatch, maximizing GPU utilization with automatic task dependency management
- Autonomous Iterative Optimization: Quality gate auto-decides whether to continue iterating, pivot to new ideas, or terminate — every dimension of research improves across iterations
- Self-Evolving System: Automatically extracts lessons across 8 categories, tracks effectiveness, prunes what doesn't work, and updates agent prompts — the system improves itself with every project
- Self-Healing System: Background agent continuously monitors for runtime errors, auto-fixes them using skill pipelines, adds regression tests, and commits fixes — all without human intervention
- Multi-Model Collaboration: Claude Opus/Sonnet + GPT-5.4 (Codex) independent cross-review
Pipeline
+== Research Iteration =============+ +== Paper Writing ====================+
| | | |
| Literature Search (arXiv + Web) | | Outline |
| | | | | |
| v | | v |
| Idea Debate (6 Agents) | | Section Writing (seq/para/Codex) |
| | | | | |
| v | | v |
| Experiment Planning | | Cross Review (6 Agents) |
| | | | | |
| v | | v |
| Pilot Experiments | | Integration & Editing |
| | | | | |
| v | | v |
| Full Experiments (GPU parallel) | | Final Review (NeurIPS level) |
| | | | | fail --> back to edit (x2) |
| v | | v |
| Result Debate (6 Agents) | | LaTeX --> compile PDF |
| | | | | |
| v | +-------|---------+-------------------+
| Decision | |
| | PIVOT --> back to Idea | |
| | PROCEED | v
+-------|-----------+----------------+ +== Review & Reflection ==============+
| | |
+----------> Outline | Review (Critic+Supervisor+Codex) |
| | |
| v |
| Reflection (lessons learned) |
| | |
| v |
| Lark Sync (cloud docs) |
| | |
| v |
| Quality Gate |
| | >= 8.0 & >= 2 iter --> DONE |
| | else --> next iteration |
| |
+-------------------------------------+
Stage Details
| Stage | Description | Agent Mode |
|---|---|---|
literature_search |
Dual-source survey via arXiv + Web | Single Agent |
idea_debate |
6-perspective idea debate (Innovator / Pragmatist / Theorist / Contrarian / Interdisciplinary / Empiricist) | 6-Agent Team |
planning |
Design experiments, generate task_plan.json with dependencies | Single Agent |
pilot_experiments |
Small-scale feasibility validation | Single Agent |
experiment_cycle |
GPU-parallel full experiments, topologically sorted batch scheduling | Single Agent + GPU Scheduler |
result_debate |
6-perspective result analysis (Optimist / Skeptic / Strategist / Methodologist / Comparativist / Revisionist) | 6-Agent Team |
experiment_decision |
Supervisor decision: PIVOT (change direction) or PROCEED | Single Agent |
writing_outline |
Generate paper outline | Single Agent |
writing_sections |
Write by section (sequential / parallel / Codex modes) | Configurable |
writing_critique |
6-agent cross-review of each section | 6-Agent Parallel |
writing_integrate |
Editor integrates into complete paper | Single Agent |
writing_final_review |
NeurIPS/ICML-level final review (can loop for revision) | Single Agent |
writing_latex |
Convert to NeurIPS LaTeX format and compile PDF | Single Agent |
review |
Critic + Supervisor + Codex parallel review | Parallel Skills |
reflection |
Classify issues, generate improvement plan, record lessons | Single Agent |
lark_sync |
Sync research data to Feishu/Lark cloud docs | Single Agent |
quality_gate |
Evaluate completion (≥8.0 score and ≥2 iterations) | Automatic |
Agent Roles
Idea Generation Team
| Agent | Perspective | Responsibility |
|---|---|---|
| Innovator | Cross-domain innovation | Bold methodology transfer and novel combinations |
| Pragmatist | Engineering feasibility | Ensure ideas are implementable |
| Theorist | Mathematical foundations | Focus on theoretical guarantees and proofs |
| Contrarian | Challenge assumptions | Find counter-evidence and blind spots |
| Interdisciplinary | Analogical inspiration | Import methods from cognitive science, physics, biology |
| Empiricist | Experiment-first | Focus on reproducibility and data quality |
Result Analysis Team
| Agent | Perspective | Responsibility |
|---|---|---|
| Optimist | Positive findings | Discover positive results and extension directions |
| Skeptic | Statistical rigor | Question statistical significance and confounders |
| Strategist | Next steps | Suggest resource allocation and research direction |
| Methodologist | Method review | Evaluate internal and external validity |
| Comparativist | SOTA benchmarking | Compare and position against existing best methods |
| Revisionist | Hypothesis revision | Reflect on and adjust hypotheses based on results |
Model Tiers
| Tier | Model | Usage |
|---|---|---|
| Heavy | Opus 4.6 | Synthesis, supervision, editing, criticism, reflection |
| Standard | Opus 4.6 | Literature survey, planning, experiments, writing |
| Light | Sonnet 4.6 | Result debate, cross-review, section critique |
| Codex | GPT-5.4 High | Independent third-party review, optional writing mode |
Self-Evolution System
Sibyl doesn't just run research — it learns how to run research better. After every iteration, the system analyzes what worked, what failed, and what was inefficient, then automatically updates itself:
Research Iteration completes
|
v
Reflection Agent ──> Analyze outcomes across 8 dimensions
| ├── Experiment design quality
| ├── Writing clarity & structure
| ├── Resource efficiency (GPU utilization, scheduling)
| ├── Idea novelty & contribution
| └── System reliability, analysis depth, planning, pipeline
v
Evolution Engine ──> Track & rank lessons
| ├── Time-weighted frequency analysis (30-day half-life)
| ├── Schema-normalized issue clustering (stable issue keys)
| └── Success pattern extraction (what to keep doing)
v
Auto-Update ──> Inject relevant improvements into agent prompts
| ├── Context-aware overlay filtering per agent/stage
| ├── Stale overlays removed automatically
| └── Efficiency insights: scheduling & resource optimization
v
Self-Check ──> Detect systemic anomalies
├── Declining quality trend across iterations
├── Recurring errors that lessons haven't fixed
└── Ineffective lesson accumulation
Why Self-Evolution Actually Works
Most AI systems that claim to "learn" are stateful processes — they accumulate context within a single session, but lose everything when the process restarts. Sibyl takes a fundamentally different approach: stateless architecture with persistent artifacts.
- Every prompt is loaded from disk at call time. There is no in-memory cache, no long-running daemon. Each agent reads its prompt file (
sibyl/prompts/*.md) fresh every time it is invoked. If the evolution engine rewrites a prompt, the very next agent call picks up the change — zero restart, zero redeployment. - Every agent runs as an independent subprocess. Skills execute via
python3 -c "..."in a fresh process, so Python modules are re-imported every time. Code changes insibyl/*.pytake effect immediately on the next stage. - Config is re-parsed per orchestrator call.
cli_next()instantiates a newOrchestratoreach time, re-readingconfig.yamlfrom disk. Parameter tuning by the evolution engine is picked up on the next tick. - Lesson overlays are plain files. Experience extracted from past projects is written to
.sibyl/evolution/lessons/{agent}.md(or the runtime path overridden bySIBYL_STATE_DIR/SIBYL_EVOLUTION_DIR). Theload_prompt()function appends the overlay content on every call — new lessons are injected into the next agent invocation automatically.
This means evolution is not a "batch update" that requires a maintenance window. It is a continuous, incremental process: the system that runs iteration N+1 is already different from the one that ran iteration N, because the reflection after iteration N has already modified prompts, overlays, and potentially code on disk. The entire system is designed so that every file is the source of truth, and every file is read fresh — making self-evolution a natural consequence of the architecture rather than a bolted-on feature.
Safety: All system file modifications are gated by mandatory tests (.venv/bin/python3 -m pytest tests/) and tracked via git commits, ensuring every evolution step is reversible and auditable.
8 Issue Categories: SYSTEM, EXPERIMENT, WRITING, ANALYSIS, PLANNING, PIPELINE, IDEATION, EFFICIENCY — each automatically routed to the relevant agents. The planner learns to design better experiments, the experimenter learns to use GPUs more efficiently, the writer learns to avoid recurring style issues — all without manual intervention.
Self-Healing System
While the self-evolution system learns from completed iterations, the self-healing system operates in real time — continuously monitoring for runtime errors and fixing them autonomously as the research pipeline runs.
Runtime Error Occurs
|
v
Error Collector ──> Structured capture to logs/errors.jsonl
| ├── Exception type, traceback, file, line
| ├── Pipeline stage & project context
| └── Automatic categorization (7 types)
v
Error Router ──> Intelligent triage
| ├── Deduplication (hash-based)
| ├── Priority sorting (import > build > type > test > ...)
| ├── Skill routing (error type → repair skill pipeline)
| └── Circuit breaker (3 failures → escalate to human)
v
Self-Healer Agent ──> Autonomous repair
| ├── Invoke mapped skills (systematic-debugging, tdd-workflow, ...)
| ├── Apply fix with scope limits (max 5 files, protected file rules)
| ├── Generate regression test to prevent recurrence
| └── Verify: full test suite must pass
v
Git Commit ──> fix(self-heal): <description> [auto]
└── All fixes tracked on dev branch, periodically synced to main
How It Works
The self-healing system is a three-layer architecture:
-
Error Collector (
sibyl/error_collector.py) — Captures runtime exceptions with full context (traceback, stage, project) into structured JSONL records. A@wrap_clidecorator automatically catches errors from all orchestrator CLI functions. Errors are categorized into 7 types:import,test,type,state,config,build,prompt. -
Error Router (
sibyl/self_heal.py) — Deduplicates errors by content hash, sorts by priority (import errors before config errors), and maps each error category to a repair skill pipeline via the skill route table. A circuit breaker prevents infinite fix loops: after 3 failed attempts on the same error, it is logged and skipped, then retried in the next iteration. -
Self-Healer Agent (
sibyl-self-healerskill) — A fork skill running on the standard tier (Opus) that receives repair tasks and autonomously:- Invokes the appropriate skills (e.g.,
systematic-debugging→tdd-workflow) - Applies the fix within scope limits (max 5 files per fix, surgical changes to protected files)
- Writes a regression test covering the exact failure condition
- Runs the full test suite to verify the fix
- Commits with
fix(self-heal): ... [auto]format for full traceability
- Invokes the appropriate skills (e.g.,
Safety Mechanisms
| Mechanism | Purpose |
|---|---|
| Circuit breaker | Same error failing 3 times → stops and flags for human review |
| File scope limit | Max 5 files modified per fix — prevents over-reaching changes |
| Protected files | Core files like orchestrate.py only allow minimal, surgical edits |
| Test gate | Full test suite must pass before any fix is committed |
| Git tracking | Every fix is a separate commit on dev — fully reversible |
Configuration
self_heal_enabled: true # Enable self-healing (default: true)
self_heal_interval_sec: 300 # Background scan interval (default: 5 min)
self_heal_max_attempts: 3 # Circuit breaker threshold (default: 3)Project Structure
sibyl-system/
├── sibyl/ # Core Python modules
│ ├── orchestrate.py # State-machine orchestrator (19-stage pipeline)
│ ├── config.py # Configuration (models/GPU/modes)
│ ├── workspace.py # Workspace file & Git management
│ ├── gpu_scheduler.py # GPU topological sort & parallel scheduling
│ ├── evolution.py # Cross-project evolution engine
│ ├── reflection.py # Iteration logging
│ ├── error_collector.py # Structured error capture for self-healing
│ ├── self_heal.py # Error routing, circuit breaker, repair orchestration
│ └── prompts/ # 40 agent prompt templates
├── .claude/
│ ├── agents/ # Agent tier definitions (heavy/standard/light)
│ └── skills/sibyl-*/ # 36 Fork Skills (isolated context execution)
├── plugin/commands/ # Claude Code plugin commands
├── tools/ # Repo-level helper tools (not research projects)
├── workspaces/ # Sibyl research project workspaces only
├── tests/ # Unit tests (~800 tests)
└── requirements.txt # Dependencies (PyYAML, rich, Flask, gunicorn)
Workspace Structure
Each research project has an independent filesystem under workspaces/<project>/:
workspaces/<project>/
├── CLAUDE.md # Effective runtime prompt (system + project memory)
├── status.json # Orchestrator state (stage/iteration/score)
├── config.yaml # Project-level config overrides
├── topic.txt / spec.md # Research topic & requirements spec
├── .sibyl/project/ # Project memory and prompt overlays
├── context/literature.md # Literature review
├── idea/ # Proposals, alternatives, debate records
├── plan/ # Experiment plan, task_plan.json
├── exp/ # Code, results, logs, GPU progress
├── writing/ # Outline, sections, reviews, full paper, LaTeX
├── logs/ # Iteration archives, research diary
└── lark_sync/ # Feishu/Lark sync registry
Repo-level helper utilities belong under tools/, not workspaces/. If a directory under workspaces/ does not have Sibyl project state such as status.json, it should be moved out.
Documentation
| Document | Description |
|---|---|
| Setup Guide | Claude-readable setup checklist (recommended) |
| Getting Started | Full installation and first-run guide |
| Configuration | All 35+ config options reference |
| MCP Servers | Third-party MCP dependencies & setup |
| SSH & GPU Setup | GPU server configuration |
| Plugin Commands | All 12 plugin commands reference |
| Codex Integration | GPT-5.4 cross-review setup |
| Feishu/Lark Setup | Cloud document sync |
| Architecture | System internals for contributors |
Third-Party Dependencies
MCP Servers
| Server | Required | Purpose | Source |
|---|---|---|---|
| SSH MCP | Yes | Remote GPU execution | npx @fangjunjie/ssh-mcp-server |
| arXiv MCP | Yes | Paper search | pip install arxiv-mcp-server |
| Google Scholar MCP | Recommended | Citation search | GitHub clone |
| Codex MCP | Optional | GPT-5.4 review | npm install -g @openai/codex |
| Lark MCP | Optional | Feishu Bitable/IM | npm install -g @larksuiteoapi/lark-mcp |
| Feishu MCP | Optional | Feishu documents | npm install -g feishu-mcp |
| bioRxiv MCP | Optional | Biology preprints | pip install biorxiv-mcp-server |
| Playwright MCP | Optional | Web browsing | npm install -g @playwright/mcp |
See MCP Servers Guide for installation and MCP registration details.
Python Dependencies
- PyYAML >= 6.0 — Config file parsing
- rich >= 13.0 — Terminal formatted output
- Flask >= 3.0 — Dashboard web server
- gunicorn >= 22.0 — Production WSGI server for dashboard
Optional Tools
- OpenAI Codex CLI — Independent cross-review (opt in with
codex_enabled: true) - Ralph Loop — Autonomous iteration loop (Claude Code plugin)
- AI Research Skills — 85 expert skills covering fine-tuning, inference, evaluation, paper writing, and more. When installed, Sibyl agents automatically discover relevant skills and invoke them on demand for best-practice guidance. See setup guide for installation.
Key Mechanisms
GPU Parallel Scheduling
The experiment stage reads task_plan.json, topologically sorts tasks by dependencies, then greedily assigns parallel execution based on available GPUs:
{
"tasks": [
{"id": "train_baseline", "depends_on": [], "gpu_count": 2, "estimated_minutes": 60},
{"id": "train_model_a", "depends_on": ["train_baseline"], "gpu_count": 1, "estimated_minutes": 90},
{"id": "train_model_b", "depends_on": ["train_baseline"], "gpu_count": 1, "estimated_minutes": 90},
{"id": "ablation", "depends_on": ["train_model_a", "train_model_b"], "gpu_count": 1, "estimated_minutes": 30}
]
}Cross-Project Self-Evolution
Lessons learned in one project automatically improve all future projects:
- Record: Classify issues (8 categories) and success patterns after each iteration
- Analyze: Aggregate with time-decay weighting (30-day half-life) — recent lessons matter more
- Normalize: Canonicalize action plans (category/severity/status/trajectory) and assign stable
issue_keys - Apply: Generate per-agent prompt overlays — each agent receives only the lessons relevant to its role and current context
- Prune: Remove stale overlays automatically when no longer supported by current insights
- Self-Check: Detect quality decline and recurring unresolved errors
PIVOT Mechanism
When experiment results are unsatisfactory, the supervisor decision agent can trigger PIVOT:
- Analyze whether results support the original hypothesis
- Evaluate whether continued investment is worthwhile
- If PIVOT: roll back to idea debate stage with alternative proposals
- Maximum 6 PIVOT cycles (configurable)
Comparison
| Feature | Sibyl Research System | AI Scientist | AutoResearch | AgentLaboratory |
|---|---|---|---|---|
| Architecture | Claude Code native (skills, teams, MCP) | API wrapper | Single-file script | API wrapper |
| Agent count | 20+ specialized agents | Single LLM | Single agent | 3 agents |
| Idea generation | 6-agent multi-perspective debate | LLM brainstorming | N/A | LLM brainstorming |
| Experiment execution | GPU-parallel with topo-sort scheduling | Template-based | Single-GPU loop | Sandbox execution |
| Paper writing | Multi-agent write + review + revise | LLM generation | N/A | LLM generation |
| Self-evolution | Cross-project lesson learning | None | None | None |
| Self-healing | Auto-detect & fix runtime errors | None | None | None |
| Quality control | Multi-round review + quality gate | Automated review | Metric-based | Human-in-the-loop |
| Human intervention | Fully autonomous | Minimal | Minimal | Optional |
Related Projects
- The AI Scientist — Pioneering automated scientific discovery by SakanaAI
- AutoResearch — Karpathy's elegant single-GPU autoresearch loop
- AgentLaboratory — End-to-end autonomous research workflow
- GPT-Researcher — Autonomous deep web research agent
- OpenScholar — Scientific literature synthesis with retrieval-augmented LMs
- EvoScientist — Self-evolving AI scientists
License
MIT License
Keywords: AI scientist, autonomous research, automated scientific discovery, research automation, multi-agent system, LLM agents, paper generation, experiment execution, GPU scheduling, self-evolving AI, self-healing, deep research, AI for science, Claude Code, MCP, autoresearch, vibe research, agentic workflow, literature review automation, hypothesis generation, automated peer review, conference paper writing
