s4um1l/aya-cross-lingual-probe
Mechanistic interpretability of cross-lingual concept representations in Tiny Aya — rise, peak, collapse.
Inside Tiny Aya: Cross-Lingual Concept Representations in Multilingual Variants
Mechanistic interpretability analysis of how regional fine-tuning affects cross-lingual concept representations in Tiny Aya (3.35B) model variants.
Key Finding
Tiny Aya builds shared cross-lingual concepts mid-network (layers 18-20) then destroys them at the output layers. All variants follow this rise-peak-collapse architecture, but regional fine-tuning (Fire, Earth) determines how much alignment the model builds: +15% for Hindi, +40% for Amharic. For most Latin-script languages (French, Spanish, Swahili), alignment is near-maximal at the embedding layer — a tokenizer artifact, not a learned capability. Yoruba is a notable exception, suggesting script alone does not guarantee embedding-level alignment.
Cross-lingual alignment curves for three languages. Swahili (Latin script) shows 1.0 from layer 0 (tokenizer artifact). Hindi shows Fire/Earth building ~15% more alignment than Base. Amharic shows the largest gains (+40%), with Earth slightly ahead of Fire.
Quick Start
# 1. Install dependencies
uv sync
# 2. Build stimulus manifest (combines concept probes + FLORES-200)
uv run build_stimuli.py
# 3. Run activation extraction (requires GPU — use Modal for A10G)
modal run batch_runner.py
# OR for local CPU/MPS execution (slower):
uv run batch_runner.py --local
# 4. Compute cross-lingual alignment curves and commitment matrix
uv run concept_alignment.py
# 5. Classify failure cases
uv run failure_classifier.py
# 6. Generate visualizations
uv run viz/heatmap.py
uv run viz/alignment_curves.py
# 7. Launch interactive demo (demo mode — no GPU required)
uv run app.py --demoProject Structure
aya-cross-lingual-probes/
├── data/
│ ├── concept_probes.json # 20 medical concepts x 10 languages (hand-verified)
│ └── stimulus_manifest.json # Combined stimulus set (generated by build_stimuli.py)
├── activations/ # Residual stream activations (float16, not in git)
│ ├── base/
│ ├── fire/
│ └── earth/
├── results/
│ ├── alignment_curves.json # Cosine similarity by layer (primary output)
│ ├── commitment_matrix.csv # Commitment layer per (language x model)
│ └── failure_cases.csv # Failure taxonomy instances
├── assets/
│ ├── tweet_three_panel.png # Primary visualization (three-panel line chart)
│ ├── alignment_curve_avg_hi.png
│ ├── alignment_curve_avg_am.png
│ ├── annotated_rise_peak_collapse.png
│ └── concept_alignment_heatmap.png
├── viz/
│ ├── heatmap.py # Commitment heatmap visualization
│ ├── alignment_curves.py # Alignment curve line plots
│ └── annotated_curve.py # Annotated rise-peak-collapse diagram
├── docs/
│ ├── adr/
│ │ ├── 001-framework.md # HF Transformers activation extraction
│ │ ├── 002-model-loading.md # Sequential over parallel loading
│ │ ├── 003-stimuli.md # FLORES-200 over machine translation
│ │ ├── 004-commitment-def.md# Commitment layer definition
│ │ └── 005-storage.md # float16 over float32
│ └── failure_taxonomy.md # 5 failure categories documented
├── model_loader.py # HF Transformers model loading + activation extraction
├── build_stimuli.py # Combines concept probes + FLORES into manifest
├── batch_runner.py # Sequential activation extraction (Modal/local)
├── concept_alignment.py # Primary analysis: alignment curves + commitment
├── failure_classifier.py # Failure taxonomy classification
├── modal_config.py # Modal compute configuration
├── app.py # Gradio interactive demo
├── p1_edge_cases.json # 32 edge cases for language routing test suite
├── REPORT.md # Research report (arXiv-style)
├── PRODUCTION_DELTA.md # Research-to-production gap analysis
├── pyproject.toml # Dependencies (managed by uv)
└── README.md
Research Question
Primary: Does Tiny Aya develop language-agnostic concept representations mid-network, or does it just translate at the final layers? Does regional fine-tuning (Fire, Earth) increase or decrease cross-lingual concept sharing compared to Base?
Secondary: Where does Base outperform Fire/Earth and why? This produces a failure taxonomy for language routing edge cases.
Models
| Variant | Region | Languages Emphasized |
|---|---|---|
| Base | Global | Broad multilingual coverage |
| Fire | South Asia | Hindi, Bengali, Tamil |
| Earth | Sub-Saharan Africa | Swahili, Amharic, Yoruba |
Languages
English (en), Hindi (hi), Bengali (bn), Swahili (sw), Amharic (am), French (fr), Spanish (es), Arabic (ar), Yoruba (yo), Tamil (ta)
Outputs
| Artifact | Description |
|---|---|
| REPORT.md | Full research report with methodology, results, and limitations |
| PRODUCTION_DELTA.md | Research-to-production gap analysis |
| docs/failure_taxonomy.md | 5 failure categories for routing edge cases |
p1_edge_cases.json |
32 edge cases formatted for the language router test suite |
results/alignment_curves.json |
Cross-lingual alignment data (3 variants x 20 concepts x 9 languages x 37 layers) |
results/commitment_matrix.csv |
Commitment layer per (concept, language, variant) |
assets/tweet_three_panel.png |
Primary visualization: three-panel alignment curve line chart |
How to Reproduce
Requirements
- Python 3.11+
- uv package manager
- 24GB RAM minimum (for sequential model loading)
- GPU recommended for batch extraction (Modal A10G or local CUDA/MPS)
Full Pipeline
# Clone and setup
git clone <repo-url>
cd aya-cross-lingual-probes
uv sync
# Run batch extraction (20 min on Modal A10G, 4-6 hrs on CPU)
modal run batch_runner.py
# Run analysis
uv run concept_alignment.py
uv run failure_classifier.py
# Generate visualizations
uv run viz/heatmap.py
uv run viz/alignment_curves.py
# Launch demo
uv run app.py --demoCompute Requirements
| Stage | Hardware | Time |
|---|---|---|
| Data prep | Any CPU | < 5 min |
| Activation extraction | Modal A10G GPU | ~20 min |
| Activation extraction | Apple Silicon / CPU | 4-6 hrs |
| Analysis | Any CPU | < 10 min |
| Visualization | Any CPU | < 5 min |
Connection to Language Routing
This analysis feeds directly into a language routing system. Specifically:
p1_edge_cases.jsonprovides test cases for the router's edge case handling- The failure taxonomy informs routing rules (when to fall back to Base)
- Alignment curve analysis shows that routing adds value only for non-Latin-script languages
- Final-layer collapse means routing decisions should not rely on output embeddings
License
MIT
Author
Saumil Srivastava
