Letting Claude do Autonomous Research on Improving SAEs
This repository contains the code accompanying the post "Letting Claude do Autonomous Research on Improving SAEs".
We pointed Claude at the SynthSAEBench-16k synthetic SAE benchmark, told it to improve SAE performance (see TASK.md), and left it running in a Ralph Wiggum loop. By the next morning it had boosted F1 score from 0.88 to 0.95, and within another day — with occasional hints from us — it matched the logistic regression probe ceiling of 0.97 F1.
The SAE architecture that resulted is in the autoresearch directory along with tests written by Claude, and (one version) of the original task specification is in TASK.md. We ran claude code with the simple prompt "Follow the instructions in TASK.md." There's a sample sprint report that Claude generated from a sprint where it explored decreasing K during training in sample_sprint_report.pdf
Results
The LISTA-Matryoshka SAE that resulted from this Claude-driven autoresearch substantially outperforms all published baselines (BatchTopK, JumpReLU, Standard, Matryoshka, MatryoshkaTopK) on both F1 score and MCC across L0 values on SynthSAEBench-16k.
It's still not clear if this will work for LLM SAEs, but it did an amazing job at the task we set out for it!
Architecture
The final architecture is a LISTA-Matryoshka SAE with Decreasing K, combining several improvements Claude discovered or refined:
| Improvement | Description |
|---|---|
| LISTA encoder | A single iteration of LISTA (neural approximation to sparse coding) as the SAE encoder. Claude found this paper autonomously. |
| Linearly decrease K | Anneal K from a higher initial value down to the target during training. |
| Detach inner Matryoshka levels | Detach gradients between Matryoshka levels except the outermost. |
| TERM loss | Tilted ERM to up-weight high-loss samples (tilt ~2e-3). Also found by Claude. |
| Dynamic Matryoshka by frequency | Sort latents by firing frequency before applying Matryoshka losses. |
Repository structure
├── autoresearch/
│ ├── sae.py # LISTA-Matryoshka SAE implementation (training + inference)
│ └── train.py # Training script with predefined experiments and ablations
├── tests/
│ └── test_sae.py # Test suite (~40 tests)
├── assets/ # Result plots
├── TASK.md # Task specification used to guide Claude's research sprints
├── main.py # Entry point
└── pyproject.toml # Project config (requires sae-lens>=6.37.6)
Usage
Install dependencies:
uv syncRun the default LISTA-Matryoshka recipe:
python autoresearch/train.py defaultRun ablations (removes one component at a time):
python autoresearch/train.py ablationsRun an L0 sweep:
python autoresearch/train.py l0_sweepRun tests:
uv run pytest tests/