MR

MrAta/nano-vllm

Nano vLLM

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

🚀 Fase offline inference - Comparable inference speeds to vLLM
📖 Readable codebase - Clean implementation under 1,200 lines of Python code
⚡ Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

Benchmark

See bench.py for benchmark.

Test Configuration:

Hardware: RTX 4070
Model: Qwen3-0.6B
Total Requests: 256 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	133,966	98.95	1353.86
Nano-vLLM	133,966	101.90	1314.65

Test Configuration:

Hardware: H800
Model: Qwen3-8B
Total Requests: 1024 sequences
Input Length: Randomly sampled between 100–1024 tokens
Output Length: Randomly sampled between 100–1024 tokens

Performance Results:

Inference Engine	Output Tokens	Time (s)	Throughput (tokens/s)
vLLM	583,802	98.67	5916.89
Nano-vLLM	583,802	86.73	6731.42

On this page

Languages

Python100.0%

Contributors

MIT License

Created June 12, 2025

Updated June 12, 2025

MrAta/nano-vllm | GitHunt