"topic:transformerlens" — Search

14 results for “topic:transformerlens”

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

Jupyter Notebook313Updated 11 months ago

anthropicexplainable-aigarconmechanistic-interpretabilitytransformertransformerlens

[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"

Python270Updated 4 months ago

circuit-analysisexplainabilityexplainable-aiinterpretabilityinterpretable-machine-learninglanguage-modelllmsmechanistic-interpretabilityneurips-2025transformerlens

krnel-ai/krnel-graph

Lightweight representation engineering dataflow operations for agent developers.

Python221Updated 4 days ago

dataflowduckdbhuggingfacehuggingface-transformerslancedbmechanistic-interpretabilityparquetpragmatic-interpretabilitypylancepytorchrepresentation-engineeringtransformerlenstransformers

stchakwdev/Pinocchio-Vector-Test

Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth × social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.

Python30Updated 3 months ago

ai-safetydeception-detectioninterpretabilitylanguage-modelsmechanistic-interpretabilitytransformerlens

ashioyajotham/exploring_saes

Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.

Python12Updated 4 months ago

activation-functionsinterpretabilitymech-interpneuron-activitysparse-autoencoderstransformerlenswandb

lciric/does-quantization-kill-interpretability

Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.

Python10Updated 1 week ago

ai-safetygptqinduction-headsmechanistic-interpretabilitypythiaquantizationscaling-studysparse-autoencodertransformer-circuitstransformerlens

zilaeric/othello-gpt-probing

Training and exploration of linear probes into Othello-GPT by Li et al. (2022)

Jupyter Notebook10Updated 2 years ago

explainabilitygptinterpretabilityothelloprobetransformerlens

mduffster/self-referent-test

Testing role-based pathways on small LLMs

Python10Updated 3 months ago

ai-alignmentai-safetyattention-mechanismsinterpretabilityllmmechanistic-interpretabilitypytorchresearchtransformerlenstransformers

mduffster/epistemic_status

Evaluating how a model 'knowing what it knows' changes from base to instruct

Python10Updated 2 months ago

llmmechanistic-interpretabilitypytorchtransformerlens

designer-coderajay/Glassbox-AI-2.0-Mechanistic-Interpretability-tool

EU AI Act Annex IV compliance audit platform + mechanistic interpretability toolkit. White-box circuit analysis, black-box audit for any model via API. Open source. MIT.

Python10Updated 13 hours ago

alignmentannex-ivattribution-patchingblack-box-testingcircuit-discoverycompliance-auditeu-ai-actexplainabilityfastapigpt2llm-compliancelogit-lensmechanistic-interpretabilitypytorchregulatory-compliancesaesparse-autoencoderstransformer-circuitstransformerlens

alexjackson1/tx

A Flax-based library for examining transformers, based on TransformerLens.

Python01Updated 2 years ago

deep-learningflaxjaxtransformerlenstransformers

78Spinoza/LLMDeHallucinator

Automated detection, visualization and suppression of hallucination-associated neurons in open-source LLMs — LLM mechanistic interpretability research tool

00Updated 2 days ago

ai-safetyh-neuronsinterpretability-researchllm-alignmentllm-hallucinationmechanistic-interpretabilitymodel-editingpacmapsparse-probingtransformerlens

ashioyajotham/greater-than-circuit

Reverse engineering the circuit responsible for the "greater than" capability in a language model

HTML00Updated 3 weeks ago

ablation-studiesactivation-patternsattention-mechanismgpt-2-smallmechanistic-interpretabilitytransformerlens

junthbasnet/CS781-LLMs

(a1)Mechanistic Interpretability using Transformer Lens (a2) PEFT

Jupyter Notebook00Updated 6 months ago

llmspeft-fine-tuning-llmtransformerlens