GitHunt
DA

dataflowr/llm_efficiency

KV Cache & LoRA for minGPT

LLM Efficiency

This homework explores two techniques for making Large Language Model inference and fine-tuning more efficient, built on top of Karpathy's minGPT.

Overview

The homework is organized in two parts:

Part Instructions Topic Key idea
1 kv_cache.md KV Cache Cache key/value states across decoding steps to avoid redundant computation
2 lora.md LoRA Freeze pre-trained weights and train only low-rank adapters for efficient fine-tuning

๐Ÿ“„ Complete instructions: homework.pdf

Part 1 โ€” KV Cache

During autoregressive inference, a standard transformer re-encodes the full sequence at every step. KV caching stores the key and value projections from previous steps so that only the new token needs to be processed, reducing per-step attention cost from O(Tยฒ) to O(T). You will modify minGPT's CausalSelfAttention, Block, and GPT classes to thread a KV cache through the forward pass, implement cached generation, and benchmark the speedup.

Part 2 โ€” LoRA

Standard fine-tuning updates all model parameters, which is expensive for large models. LoRA (Low-Rank Adaptation) freezes the pre-trained weights and injects a trainable low-rank decomposition into each target layer. At inference time, the adapter can be merged into the base weights, adding zero overhead. You will implement LoRALinear with merge/de-merge support, integrate it into minGPT's attention layers, and fine-tune the model to generalize to longer sorting sequences.

Setup

Requirements

  • Python >= 3.9
  • uv (recommended package manager)

Installation

Clone the repo and then run:

cd llm_efficiency
uv sync --extra dev

Dependencies

Installed automatically via uv sync:

  • torch โ€” deep learning framework
  • numpy โ€” numerical computation
  • transformers โ€” tokenizer (used by minGPT utilities)
  • pytest โ€” test suite (dev dependency)

Running

Part 1 โ€” KV Cache

uv run kv_cache/demo_sort_kv.py       # Train on sorting task, verify generate_kv matches generate
uv run kv_cache/benchmark.py           # Benchmark KV cache vs. baseline across model sizes

Part 2 โ€” LoRA

uv run lora/demo_sort_lora.py          # Pre-train, evaluate distribution shift, LoRA fine-tune

Tests

Run all tests:

uv run pytest tests/test_kv_cache.py -v    # Part 1
uv run pytest tests/test_lora.py -v        # Part 2

Or run the full grading script at once:

./test_and_submit.sh

Project Structure

llm_efficiency/
โ”œโ”€โ”€ mingpt/                        # Karpathy's minGPT (unmodified)
โ”‚   โ”œโ”€โ”€ model.py                   # GPT model definition
โ”‚   โ”œโ”€โ”€ trainer.py                 # Training loop
โ”‚   โ””โ”€โ”€ utils.py                   # Utilities
โ”œโ”€โ”€ kv_cache/                      # Part 1 โ€” KV Cache
โ”‚   โ”œโ”€โ”€ kv_cache.md                # Part 1 description
โ”‚   โ”œโ”€โ”€ kv_cache.py                # KV cache implementation (to complete)
โ”‚   โ”œโ”€โ”€ demo_sort_kv.py            # Sorting task demo with KV cache
โ”‚   โ””โ”€โ”€ benchmark.py               # Latency benchmark
โ”œโ”€โ”€ lora/                          # Part 2 โ€” LoRA
โ”‚   โ”œโ”€โ”€ lora.md                    # Part 2 description
โ”‚   โ”œโ”€โ”€ lora.py                    # LoRALinear implementation (to complete)
โ”‚   โ””โ”€โ”€ demo_sort_lora.py          # Sorting task with LoRA fine-tuning (to complete)
โ”œโ”€โ”€ practicals/                    # Jupyter notebooks 
โ”‚   โ”œโ”€โ”€ KV_cache_empty.ipynb
โ”‚   โ””โ”€โ”€ Lora_empty.ipynb
โ”œโ”€โ”€ tests/                         # Test suite
โ”‚   โ”œโ”€โ”€ test_kv_cache.py
โ”‚   โ””โ”€โ”€ test_lora.py
โ”œโ”€โ”€ test_and_submit.sh             # Full grading script
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md                      # This file

License

Apache 2.0

Languages

Python60.4%Jupyter Notebook38.1%Shell1.5%

Contributors

Apache License 2.0
Created March 2, 2026
Updated March 9, 2026
dataflowr/llm_efficiency | GitHunt