CoEvoLve
A living benchmark that evolves faster than any model can game it.
Gemini 3 Paris Hackathon 2026 · Built with Gemini 3 Pro + Flash · Google Antigravity
The problem with benchmarks
Static benchmarks leak. Once a model trains on the internet, it trains on the benchmark too.
The solution isn't a bigger benchmark — it's a benchmark that evolves.
What CoEvoLve does
Two populations of Gemini agents co-evolve in an open-ended arms race:
- Examiners generate causal reasoning tasks, evolving toward maximum discriminative power — tasks that separate strong solvers from weak ones
- Solvers attempt those tasks, evolving better reasoning strategies
Neither population has a fixed target. The examiner's fitness depends on the current solver population. The solver's fitness depends on the current examiner population. This is the Red Queen dynamic applied to AI benchmarking.
A human-in-the-loop layer allows optional steering: rate tasks on the Pareto frontier to bias the evolutionary search toward scientifically interesting regions.
Why causal reasoning
LLMs have a well-documented, measurable failure mode on causal reasoning:
Graph: U → X, U → Y, X → Y (U is hidden confounder)
P(Y=1 | X=1) ≈ 0.72 ← what baseline LLMs report
P(Y=1 | do(X=1)) ≈ 0.55 ← correct causal answer
Error: +0.17, consistently, across model families
Ground truth is computed by exact do-calculus (pgmpy). No LLM judge.
No hallucination risk in the evaluation pipeline.
Performance arc
| Level | Task structure | Baseline LLM | Human | Evolved solver |
|---|---|---|---|---|
| Baseline | 3-node chain, no confounders | ~85% | ~95% | ~90% |
| Human threshold | 4-5 nodes, 1 confounder, counterfactual | ~45% | ~75% | ~80% |
| Beyond human | 6+ nodes, nested counterfactual | ~25% | ~40% | ~65% |
The examiner evolves toward the boundary between human threshold and beyond human.
The evolved solver develops systematic strategies that exceed human performance on the hardest tasks.
Architecture
Examiner Population (Gemini 3 Pro mutations)
↕ arms race
Solver Population (Gemini 3 Flash evaluation)
↓
Deterministic Oracle (pgmpy do-calculus)
↓
Live Dashboard (React)
↑
Human Ratings (optional steering)
Three solver variants
| Solver | Strategy | Represents |
|---|---|---|
zero_shot |
Plain question to Gemini Flash | Commodity LLM baseline |
cot |
Step-by-step causal reasoning | Better prompting |
evolved |
Co-evolved prompt strategy | System's discovered approach |
Quick start
git clone https://github.com/yourteam/coevolve
cd coevolve
# Environment
cp .env.example .env
# Add your GEMINI_API_KEY
# Install
pip install -r requirements.txt
cd dashboard && npm install && cd ..
# Run everything
python loop.py & # coevolution loop
uvicorn api:app & # backend
cd dashboard && npm run dev # frontendOpen http://localhost:5173 to see the live dashboard.
Dashboard
Three panels updated in real time via SSE:
Arena — current task as a causal DAG, three solver responses side by side,
ground truth revealed after evaluation, human rating buttons
Arms Race — examiner discriminative power vs solver pass rate over generations,
Pareto scatter plot of the examiner population
Genealogy — ShinkaEvolve-style evolution tree of the examiner archive,
colored by discriminative power, hover for difficulty hypothesis
Human-in-the-loop
Click any point on the Pareto scatter to inspect that task.
Rate it 1 (boring) / 2 (interesting) / 3 (fascinating).
The rating biases parent sampling — fascinating tasks get explored more deeply.
This is optional: unrated tasks run with neutral weight.
Cost
Runs on ~$0.004/generation (Gemini 3 Pro for examiner, Flash for solvers).
$20 in API credits → ~5000 generations.
Overnight run of 3000 generations costs ~$12.
What the system discovers
The examiner population independently discovers that:
- Hidden confounders with effect size > 0.3 are maximally discriminative
- Tasks requiring tracking 3+ mediator nodes exceed solver working memory
- Simultaneous interventions on correlated nodes create irreducible difficulty
These are findings, not design decisions. The system discovers the structure of
LLM causal reasoning failure automatically.
Built with
- Gemini 3 Pro — examiner mutation and reasoning
- Gemini 3 Flash — solver evaluation at scale
- Google Antigravity — development and agent orchestration
- pgmpy — deterministic causal oracle
- ShinkaEvolve (Sakana AI, Apache 2.0) — evolutionary framework, forked for dual-archive coevolution
- FastAPI — backend API + SSE streaming
- React + Recharts + React Flow — live dashboard
Team
Built at Gemini 3 Paris Hackathon 2026 in 7 hours.
License
MIT