Nano-vLLM
A lightweight vLLM implementation built from scratch.
Key Features
- ๐ Fase offline inference - Comparable inference speeds to vLLM
- ๐ Readable codebase - Clean implementation under 1,200 lines of Python code
- โก Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc
Installation
pip install git+https://github.com/GeeeekExplorer/nano-vllm.gitQuick Start
See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.
Benchmark
See bench.py for benchmark.
Test Configuration:
- Hardware: RTX 4070
- Model: Qwen3-0.6B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100โ1024 tokens
- Output Length: Randomly sampled between 100โ1024 tokens
Performance Results:
| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| vLLM | 133,966 | 98.95 | 1353.86 |
| Nano-vLLM | 133,966 | 101.90 | 1314.65 |
Test Configuration:
- Hardware: H800
- Model: Qwen3-8B
- Total Requests: 1024 sequences
- Input Length: Randomly sampled between 100โ1024 tokens
- Output Length: Randomly sampled between 100โ1024 tokens
Performance Results:
| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| vLLM | 583,802 | 98.67 | 5916.89 |
| Nano-vLLM | 583,802 | 86.73 | 6731.42 |