CoEvoLve

A living benchmark that evolves faster than any model can game it.

Gemini 3 Paris Hackathon 2026 · Built with Gemini 3 Pro + Flash · Google Antigravity

The problem with benchmarks

Static benchmarks leak. Once a model trains on the internet, it trains on the benchmark too.
The solution isn't a bigger benchmark — it's a benchmark that evolves.

What CoEvoLve does

Two populations of Gemini agents co-evolve in an open-ended arms race:

Examiners generate causal reasoning tasks, evolving toward maximum discriminative power — tasks that separate strong solvers from weak ones
Solvers attempt those tasks, evolving better reasoning strategies

Neither population has a fixed target. The examiner's fitness depends on the current solver population. The solver's fitness depends on the current examiner population. This is the Red Queen dynamic applied to AI benchmarking.

A human-in-the-loop layer allows optional steering: rate tasks on the Pareto frontier to bias the evolutionary search toward scientifically interesting regions.

Why causal reasoning

LLMs have a well-documented, measurable failure mode on causal reasoning:

Graph: U → X, U → Y, X → Y  (U is hidden confounder)

P(Y=1 | X=1)       ≈ 0.72  ← what baseline LLMs report
P(Y=1 | do(X=1))   ≈ 0.55  ← correct causal answer

Error: +0.17, consistently, across model families

Ground truth is computed by exact do-calculus (pgmpy). No LLM judge.
No hallucination risk in the evaluation pipeline.

Performance arc

Level	Task structure	Baseline LLM	Human	Evolved solver
Baseline	3-node chain, no confounders	~85%	~95%	~90%
Human threshold	4-5 nodes, 1 confounder, counterfactual	~45%	~75%	~80%
Beyond human	6+ nodes, nested counterfactual	~25%	~40%	~65%

The examiner evolves toward the boundary between human threshold and beyond human.
The evolved solver develops systematic strategies that exceed human performance on the hardest tasks.

Architecture

Examiner Population (Gemini 3 Pro mutations)
    ↕  arms race
Solver Population  (Gemini 3 Flash evaluation)
    ↓
Deterministic Oracle (pgmpy do-calculus)
    ↓
Live Dashboard (React)
    ↑
Human Ratings (optional steering)

Three solver variants

Solver	Strategy	Represents
`zero_shot`	Plain question to Gemini Flash	Commodity LLM baseline
`cot`	Step-by-step causal reasoning	Better prompting
`evolved`	Co-evolved prompt strategy	System's discovered approach

Quick start

git clone https://github.com/yourteam/coevolve
cd coevolve

# Environment
cp .env.example .env
# Add your GEMINI_API_KEY

# Install
pip install -r requirements.txt
cd dashboard && npm install && cd ..

# Run everything
python loop.py &          # coevolution loop
uvicorn api:app &         # backend
cd dashboard && npm run dev  # frontend

Open http://localhost:5173 to see the live dashboard.

Dashboard

Three panels updated in real time via SSE:

Arena — current task as a causal DAG, three solver responses side by side,
ground truth revealed after evaluation, human rating buttons

Arms Race — examiner discriminative power vs solver pass rate over generations,
Pareto scatter plot of the examiner population

Genealogy — ShinkaEvolve-style evolution tree of the examiner archive,
colored by discriminative power, hover for difficulty hypothesis

Human-in-the-loop

Click any point on the Pareto scatter to inspect that task.
Rate it 1 (boring) / 2 (interesting) / 3 (fascinating).
The rating biases parent sampling — fascinating tasks get explored more deeply.
This is optional: unrated tasks run with neutral weight.

Cost

Runs on ~$0.004/generation (Gemini 3 Pro for examiner, Flash for solvers).
$20 in API credits → ~5000 generations.
Overnight run of 3000 generations costs ~$12.

What the system discovers

The examiner population independently discovers that:

Hidden confounders with effect size > 0.3 are maximally discriminative
Tasks requiring tracking 3+ mediator nodes exceed solver working memory
Simultaneous interventions on correlated nodes create irreducible difficulty

These are findings, not design decisions. The system discovers the structure of
LLM causal reasoning failure automatically.

Built with

Gemini 3 Pro — examiner mutation and reasoning
Gemini 3 Flash — solver evaluation at scale
Google Antigravity — development and agent orchestration
pgmpy — deterministic causal oracle
ShinkaEvolve (Sakana AI, Apache 2.0) — evolutionary framework, forked for dual-archive coevolution
FastAPI — backend API + SSE streaming
React + Recharts + React Flow — live dashboard

Team

Built at Gemini 3 Paris Hackathon 2026 in 7 hours.

License

MIT

Clemspace/coevolve