LLM Efficiency
This homework explores two techniques for making Large Language Model inference and fine-tuning more efficient, built on top of Karpathy's minGPT.
Overview
The homework is organized in two parts:
| Part | Instructions | Topic | Key idea |
|---|---|---|---|
| 1 | kv_cache.md |
KV Cache | Cache key/value states across decoding steps to avoid redundant computation |
| 2 | lora.md |
LoRA | Freeze pre-trained weights and train only low-rank adapters for efficient fine-tuning |
๐ Complete instructions: homework.pdf
Part 1 โ KV Cache
During autoregressive inference, a standard transformer re-encodes the full sequence at every step. KV caching stores the key and value projections from previous steps so that only the new token needs to be processed, reducing per-step attention cost from O(Tยฒ) to O(T). You will modify minGPT's CausalSelfAttention, Block, and GPT classes to thread a KV cache through the forward pass, implement cached generation, and benchmark the speedup.
Part 2 โ LoRA
Standard fine-tuning updates all model parameters, which is expensive for large models. LoRA (Low-Rank Adaptation) freezes the pre-trained weights and injects a trainable low-rank decomposition into each target layer. At inference time, the adapter can be merged into the base weights, adding zero overhead. You will implement LoRALinear with merge/de-merge support, integrate it into minGPT's attention layers, and fine-tune the model to generalize to longer sorting sequences.
Setup
Requirements
- Python >= 3.9
- uv (recommended package manager)
Installation
Clone the repo and then run:
cd llm_efficiency
uv sync --extra devDependencies
Installed automatically via uv sync:
torchโ deep learning frameworknumpyโ numerical computationtransformersโ tokenizer (used by minGPT utilities)pytestโ test suite (dev dependency)
Running
Part 1 โ KV Cache
uv run kv_cache/demo_sort_kv.py # Train on sorting task, verify generate_kv matches generate
uv run kv_cache/benchmark.py # Benchmark KV cache vs. baseline across model sizesPart 2 โ LoRA
uv run lora/demo_sort_lora.py # Pre-train, evaluate distribution shift, LoRA fine-tuneTests
Run all tests:
uv run pytest tests/test_kv_cache.py -v # Part 1
uv run pytest tests/test_lora.py -v # Part 2Or run the full grading script at once:
./test_and_submit.shProject Structure
llm_efficiency/
โโโ mingpt/ # Karpathy's minGPT (unmodified)
โ โโโ model.py # GPT model definition
โ โโโ trainer.py # Training loop
โ โโโ utils.py # Utilities
โโโ kv_cache/ # Part 1 โ KV Cache
โ โโโ kv_cache.md # Part 1 description
โ โโโ kv_cache.py # KV cache implementation (to complete)
โ โโโ demo_sort_kv.py # Sorting task demo with KV cache
โ โโโ benchmark.py # Latency benchmark
โโโ lora/ # Part 2 โ LoRA
โ โโโ lora.md # Part 2 description
โ โโโ lora.py # LoRALinear implementation (to complete)
โ โโโ demo_sort_lora.py # Sorting task with LoRA fine-tuning (to complete)
โโโ practicals/ # Jupyter notebooks
โ โโโ KV_cache_empty.ipynb
โ โโโ Lora_empty.ipynb
โโโ tests/ # Test suite
โ โโโ test_kv_cache.py
โ โโโ test_lora.py
โโโ test_and_submit.sh # Full grading script
โโโ pyproject.toml
โโโ README.md # This file
License
Apache 2.0