GitHunt
CL

Clemspace/coevolve

CoEvoLve

A living benchmark that evolves faster than any model can game it.

Gemini 3 Paris Hackathon 2026 · Built with Gemini 3 Pro + Flash · Google Antigravity


The problem with benchmarks

Static benchmarks leak. Once a model trains on the internet, it trains on the benchmark too.
The solution isn't a bigger benchmark — it's a benchmark that evolves.

What CoEvoLve does

Two populations of Gemini agents co-evolve in an open-ended arms race:

  • Examiners generate causal reasoning tasks, evolving toward maximum discriminative power — tasks that separate strong solvers from weak ones
  • Solvers attempt those tasks, evolving better reasoning strategies

Neither population has a fixed target. The examiner's fitness depends on the current solver population. The solver's fitness depends on the current examiner population. This is the Red Queen dynamic applied to AI benchmarking.

A human-in-the-loop layer allows optional steering: rate tasks on the Pareto frontier to bias the evolutionary search toward scientifically interesting regions.

Why causal reasoning

LLMs have a well-documented, measurable failure mode on causal reasoning:

Graph: U → X, U → Y, X → Y  (U is hidden confounder)

P(Y=1 | X=1)       ≈ 0.72  ← what baseline LLMs report
P(Y=1 | do(X=1))   ≈ 0.55  ← correct causal answer

Error: +0.17, consistently, across model families

Ground truth is computed by exact do-calculus (pgmpy). No LLM judge.
No hallucination risk in the evaluation pipeline.

Performance arc

Level Task structure Baseline LLM Human Evolved solver
Baseline 3-node chain, no confounders ~85% ~95% ~90%
Human threshold 4-5 nodes, 1 confounder, counterfactual ~45% ~75% ~80%
Beyond human 6+ nodes, nested counterfactual ~25% ~40% ~65%

The examiner evolves toward the boundary between human threshold and beyond human.
The evolved solver develops systematic strategies that exceed human performance on the hardest tasks.

Architecture

Examiner Population (Gemini 3 Pro mutations)
    ↕  arms race
Solver Population  (Gemini 3 Flash evaluation)
    ↓
Deterministic Oracle (pgmpy do-calculus)
    ↓
Live Dashboard (React)
    ↑
Human Ratings (optional steering)

Three solver variants

Solver Strategy Represents
zero_shot Plain question to Gemini Flash Commodity LLM baseline
cot Step-by-step causal reasoning Better prompting
evolved Co-evolved prompt strategy System's discovered approach

Quick start

git clone https://github.com/yourteam/coevolve
cd coevolve

# Environment
cp .env.example .env
# Add your GEMINI_API_KEY

# Install
pip install -r requirements.txt
cd dashboard && npm install && cd ..

# Run everything
python loop.py &          # coevolution loop
uvicorn api:app &         # backend
cd dashboard && npm run dev  # frontend

Open http://localhost:5173 to see the live dashboard.

Dashboard

Three panels updated in real time via SSE:

Arena — current task as a causal DAG, three solver responses side by side,
ground truth revealed after evaluation, human rating buttons

Arms Race — examiner discriminative power vs solver pass rate over generations,
Pareto scatter plot of the examiner population

Genealogy — ShinkaEvolve-style evolution tree of the examiner archive,
colored by discriminative power, hover for difficulty hypothesis

Human-in-the-loop

Click any point on the Pareto scatter to inspect that task.
Rate it 1 (boring) / 2 (interesting) / 3 (fascinating).
The rating biases parent sampling — fascinating tasks get explored more deeply.
This is optional: unrated tasks run with neutral weight.

Cost

Runs on ~$0.004/generation (Gemini 3 Pro for examiner, Flash for solvers).
$20 in API credits → ~5000 generations.
Overnight run of 3000 generations costs ~$12.

What the system discovers

The examiner population independently discovers that:

  • Hidden confounders with effect size > 0.3 are maximally discriminative
  • Tasks requiring tracking 3+ mediator nodes exceed solver working memory
  • Simultaneous interventions on correlated nodes create irreducible difficulty

These are findings, not design decisions. The system discovers the structure of
LLM causal reasoning failure automatically.

Built with

  • Gemini 3 Pro — examiner mutation and reasoning
  • Gemini 3 Flash — solver evaluation at scale
  • Google Antigravity — development and agent orchestration
  • pgmpy — deterministic causal oracle
  • ShinkaEvolve (Sakana AI, Apache 2.0) — evolutionary framework, forked for dual-archive coevolution
  • FastAPI — backend API + SSE streaming
  • React + Recharts + React Flow — live dashboard

Team

Built at Gemini 3 Paris Hackathon 2026 in 7 hours.

License

MIT

Clemspace/coevolve | GitHunt