14 results for “topic:transformerlens”
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"
Lightweight representation engineering dataflow operations for agent developers.
Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth × social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.
Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.
Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.
Training and exploration of linear probes into Othello-GPT by Li et al. (2022)
Testing role-based pathways on small LLMs
Evaluating how a model 'knowing what it knows' changes from base to instruct
EU AI Act Annex IV compliance audit platform + mechanistic interpretability toolkit. White-box circuit analysis, black-box audit for any model via API. Open source. MIT.
A Flax-based library for examining transformers, based on TransformerLens.
Automated detection, visualization and suppression of hallucination-associated neurons in open-source LLMs — LLM mechanistic interpretability research tool
Reverse engineering the circuit responsible for the "greater than" capability in a language model
(a1)Mechanistic Interpretability using Transformer Lens (a2) PEFT