"topic:paged-attention" — Search

8 results for “topic:paged-attention”

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python5.0k350Updated 2 hours ago

awesome-llmdeepseekdeepseek-r1deepseek-v3flash-attentionflash-attention-3flash-mlallm-inferenceminimax-01mlapaged-attentionqwen3tensorrt-llmvllm

lumia431/photon_infer

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

C++954Updated 2 hours ago

ai-infracontinuous-batchinginference-enginellm-inferencemodern-cpppaged-attentionvllm

VARUN3WARE/Paged-Attention

Implementation of PagedAttention from vLLM paper - a breakthrough attention algorithm that treats KV cache like virtual memory. Eliminates memory fragmentation, increases batch sizes, and dramatically improves LLM serving throughput.

Python61Updated 1 week ago

kv-cachellm-inferencememory-optimizationpaged-attentiontransformer-optimization

gyunggyung/Agent.cpp

High-performance On-Device MoA (Mixture of Agents) Engine in C++. Optimized for CPU inference with RadixCache & PagedAttention. (Tiny-MoA Native)

C++30Updated 1 month ago

ccppcpu-optimizationggmlllama-cppllamacppllmmixture-of-agentsmoaon-device-aipaged-attentionradix-attention

LessUp/hetero-paged-infer

PagedAttention + Continuous Batching Inference Engine Prototype (Rust): Paged KV Cache & Dynamic Scheduling | PagedAttention + Continuous Batching 推理引擎原型（Rust），KV Cache 分页管理与动态调度

Rust00Updated 1 day ago

continuous-batchinggpu-computingllmllm-inferencepaged-attentionrust

giulio98/langchain-pced

LangChain integration for Parallel Context-of-Experts Decoding (PCED)

Python00Updated 3 weeks ago

constrastive-decodinglangchainpaged-attentionpcedragtransformers

nshkrdotcom/vllm

vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities

Elixir00Updated 1 month ago

aiattentionbatchingcudadeep-learningdistributed-systemsgpuinferencelanguage-modelllmmachine-learningnatural-language-processingnshkr-ai-sdkopenai-apipaged-attentionpytorchquantizationservingtensor-parallelismtransformer

yogik84/Tiny-MoA

🤖 Enhance task management with Tiny MoA, a GPU-free multi-agent system that plans, reasons, and collaborates efficiently in real time.

Python00Updated just now

agentsccppcpu-optimizationfalconggmllfm2lightweightllama-cppllamacppllmmoaon-device-aipaged-attentionradix-attentiontool-callinguv