GitHunt
S4

s4um1l/aya-cross-lingual-probe

Mechanistic interpretability of cross-lingual concept representations in Tiny Aya — rise, peak, collapse.

Inside Tiny Aya: Cross-Lingual Concept Representations in Multilingual Variants

Mechanistic interpretability analysis of how regional fine-tuning affects cross-lingual concept representations in Tiny Aya (3.35B) model variants.

Key Finding

Tiny Aya builds shared cross-lingual concepts mid-network (layers 18-20) then destroys them at the output layers. All variants follow this rise-peak-collapse architecture, but regional fine-tuning (Fire, Earth) determines how much alignment the model builds: +15% for Hindi, +40% for Amharic. For most Latin-script languages (French, Spanish, Swahili), alignment is near-maximal at the embedding layer — a tokenizer artifact, not a learned capability. Yoruba is a notable exception, suggesting script alone does not guarantee embedding-level alignment.

Three-panel alignment curves showing Swahili (Latin script overlap), Hindi (Fire/Earth +15%), Amharic (Earth > Fire)

Cross-lingual alignment curves for three languages. Swahili (Latin script) shows 1.0 from layer 0 (tokenizer artifact). Hindi shows Fire/Earth building ~15% more alignment than Base. Amharic shows the largest gains (+40%), with Earth slightly ahead of Fire.

Quick Start

# 1. Install dependencies
uv sync

# 2. Build stimulus manifest (combines concept probes + FLORES-200)
uv run build_stimuli.py

# 3. Run activation extraction (requires GPU — use Modal for A10G)
modal run batch_runner.py
# OR for local CPU/MPS execution (slower):
uv run batch_runner.py --local

# 4. Compute cross-lingual alignment curves and commitment matrix
uv run concept_alignment.py

# 5. Classify failure cases
uv run failure_classifier.py

# 6. Generate visualizations
uv run viz/heatmap.py
uv run viz/alignment_curves.py

# 7. Launch interactive demo (demo mode — no GPU required)
uv run app.py --demo

Project Structure

aya-cross-lingual-probes/
├── data/
│   ├── concept_probes.json      # 20 medical concepts x 10 languages (hand-verified)
│   └── stimulus_manifest.json   # Combined stimulus set (generated by build_stimuli.py)
├── activations/                 # Residual stream activations (float16, not in git)
│   ├── base/
│   ├── fire/
│   └── earth/
├── results/
│   ├── alignment_curves.json    # Cosine similarity by layer (primary output)
│   ├── commitment_matrix.csv    # Commitment layer per (language x model)
│   └── failure_cases.csv        # Failure taxonomy instances
├── assets/
│   ├── tweet_three_panel.png    # Primary visualization (three-panel line chart)
│   ├── alignment_curve_avg_hi.png
│   ├── alignment_curve_avg_am.png
│   ├── annotated_rise_peak_collapse.png
│   └── concept_alignment_heatmap.png
├── viz/
│   ├── heatmap.py               # Commitment heatmap visualization
│   ├── alignment_curves.py      # Alignment curve line plots
│   └── annotated_curve.py       # Annotated rise-peak-collapse diagram
├── docs/
│   ├── adr/
│   │   ├── 001-framework.md     # HF Transformers activation extraction
│   │   ├── 002-model-loading.md # Sequential over parallel loading
│   │   ├── 003-stimuli.md       # FLORES-200 over machine translation
│   │   ├── 004-commitment-def.md# Commitment layer definition
│   │   └── 005-storage.md       # float16 over float32
│   └── failure_taxonomy.md      # 5 failure categories documented
├── model_loader.py              # HF Transformers model loading + activation extraction
├── build_stimuli.py             # Combines concept probes + FLORES into manifest
├── batch_runner.py              # Sequential activation extraction (Modal/local)
├── concept_alignment.py         # Primary analysis: alignment curves + commitment
├── failure_classifier.py        # Failure taxonomy classification
├── modal_config.py              # Modal compute configuration
├── app.py                       # Gradio interactive demo
├── p1_edge_cases.json           # 32 edge cases for language routing test suite
├── REPORT.md                    # Research report (arXiv-style)
├── PRODUCTION_DELTA.md          # Research-to-production gap analysis
├── pyproject.toml               # Dependencies (managed by uv)
└── README.md

Research Question

Primary: Does Tiny Aya develop language-agnostic concept representations mid-network, or does it just translate at the final layers? Does regional fine-tuning (Fire, Earth) increase or decrease cross-lingual concept sharing compared to Base?

Secondary: Where does Base outperform Fire/Earth and why? This produces a failure taxonomy for language routing edge cases.

Models

Variant Region Languages Emphasized
Base Global Broad multilingual coverage
Fire South Asia Hindi, Bengali, Tamil
Earth Sub-Saharan Africa Swahili, Amharic, Yoruba

Languages

English (en), Hindi (hi), Bengali (bn), Swahili (sw), Amharic (am), French (fr), Spanish (es), Arabic (ar), Yoruba (yo), Tamil (ta)

Outputs

Artifact Description
REPORT.md Full research report with methodology, results, and limitations
PRODUCTION_DELTA.md Research-to-production gap analysis
docs/failure_taxonomy.md 5 failure categories for routing edge cases
p1_edge_cases.json 32 edge cases formatted for the language router test suite
results/alignment_curves.json Cross-lingual alignment data (3 variants x 20 concepts x 9 languages x 37 layers)
results/commitment_matrix.csv Commitment layer per (concept, language, variant)
assets/tweet_three_panel.png Primary visualization: three-panel alignment curve line chart

How to Reproduce

Requirements

  • Python 3.11+
  • uv package manager
  • 24GB RAM minimum (for sequential model loading)
  • GPU recommended for batch extraction (Modal A10G or local CUDA/MPS)

Full Pipeline

# Clone and setup
git clone <repo-url>
cd aya-cross-lingual-probes
uv sync

# Run batch extraction (20 min on Modal A10G, 4-6 hrs on CPU)
modal run batch_runner.py

# Run analysis
uv run concept_alignment.py
uv run failure_classifier.py

# Generate visualizations
uv run viz/heatmap.py
uv run viz/alignment_curves.py

# Launch demo
uv run app.py --demo

Compute Requirements

Stage Hardware Time
Data prep Any CPU < 5 min
Activation extraction Modal A10G GPU ~20 min
Activation extraction Apple Silicon / CPU 4-6 hrs
Analysis Any CPU < 10 min
Visualization Any CPU < 5 min

Connection to Language Routing

This analysis feeds directly into a language routing system. Specifically:

  • p1_edge_cases.json provides test cases for the router's edge case handling
  • The failure taxonomy informs routing rules (when to fall back to Base)
  • Alignment curve analysis shows that routing adds value only for non-Latin-script languages
  • Final-layer collapse means routing decisions should not rely on output embeddings

License

MIT

Author

Saumil Srivastava

s4um1l/aya-cross-lingual-probe | GitHunt