GitHunt
MR

MrAta/nano-vllm

Nano vLLM

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

  • ๐Ÿš€ Fase offline inference - Comparable inference speeds to vLLM
  • ๐Ÿ“– Readable codebase - Clean implementation under 1,200 lines of Python code
  • โšก Optimization Suite - Prefix caching, Torch compilation, CUDA graph, etc

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Quick Start

See example.py for usage. The API mirrors vLLM's interface with minor differences in the LLM.generate method.

Benchmark

See bench.py for benchmark.

Test Configuration:

  • Hardware: RTX 4070
  • Model: Qwen3-0.6B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 100โ€“1024 tokens
  • Output Length: Randomly sampled between 100โ€“1024 tokens

Performance Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
vLLM 133,966 98.95 1353.86
Nano-vLLM 133,966 101.90 1314.65

Test Configuration:

  • Hardware: H800
  • Model: Qwen3-8B
  • Total Requests: 1024 sequences
  • Input Length: Randomly sampled between 100โ€“1024 tokens
  • Output Length: Randomly sampled between 100โ€“1024 tokens

Performance Results:

Inference Engine Output Tokens Time (s) Throughput (tokens/s)
vLLM 583,802 98.67 5916.89
Nano-vLLM 583,802 86.73 6731.42

Languages

Python100.0%

Contributors

MIT License
Created June 12, 2025
Updated June 12, 2025
MrAta/nano-vllm | GitHunt