"topic:kv-cache" — Search

Run larger LLMs with longer contexts on Apple Silicon by using differentiated precision for KV cache quantization. KVSplit enables 8-bit keys & 4-bit values, reducing memory by 59% with <1% quality loss. Includes benchmarking, visualization, and one-command setup. Optimized for M1/M2/M3 Macs with Metal support.

Python36213Updated 3 weeks ago

apple-silicongenerative-aikv-cachellama-cppllmm1m2m3memory-optimizationmetaloptimizationquantization

thu-nics/C2C

[ICLR'26] The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models"

Python35740Updated 18 hours ago

kv-cachellmmulti-agent

jjiantong/Awesome-KV-Cache-Optimization

[Survey] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization

Python30910Updated 13 hours ago

aicomputer-architecturekv-cachellmllm-inferencellm-servingmachine-learningmlsysneural-language-processingserving-mlsystem

NVIDIA-Merlin/HierarchicalKV

HierarchicalKV is a part of NVIDIA Merlin and provides hierarchical key-value storage to meet RecSys requirements. The key capability of HierarchicalKV is to store key-value feature-embeddings on high-bandwidth memory (HBM) of GPUs and in host memory. It also can be used as a generic key-value storage.

Cuda19634Updated 3 days ago

cudadynamic-embeddingembedding-storagegpuhashtablekey-value-storekv-cacherecommender-system

itsnamgyu/block-transformer

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

Python1639Updated 2 months ago

kv-cachekv-cache-compressionllmllm-architecturellm-inference

FastMAS/KVCOMM

[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Python13315Updated 3 hours ago

kv-cachemulti-agent-systemsneurips-2025

FelixMessi/Awesome-Efficient-dLLMs

📚 A curated list of Awesome Efficient dLLMs Papers with Codes

11512Updated 6 hours ago

diffusiondiffusion-language-modelsefficient-deep-learninginference-acceleartionkv-cachemodel-compressionmultimodalmultimodal-diffusion-language-modelsparallel-decodingsystem-frametraining-efficiency

alibaba/tair-kvcache

Alibaba Cloud's high-performance KVCache system for LLM inference, with components for global cache management, inference simulation(HiSim), and more.

C++8410Updated 2 hours ago

hisimkv-cachekvcachellmsimulator

kddubey/cappr

Completion After Prompt Probability. Make your LLM make a choice

Python823Updated 3 months ago

huggingfacekv-cachellamacppllm-inferenceprobabilityprompt-engineeringtext-classificationzero-shot

aju22/LLaMA2

This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT) variant. The implementation focuses on the model architecture and the inference process. The code is restructured and heavily commented to facilitate easy understanding of the key parts of the architecture.

Python748Updated 3 months ago

attentiongptkv-cachellamallama2llmnatural-language-processingrms-normropetransformer

hkproj/pytorch-llama-notes

Notes about LLaMA 2 model

Python7310Updated 3 weeks ago

attention-is-all-you-needkv-cachellama2rmsproprotary-position-encodingstudy-notes

DRSY/EasyKV

Easy control for Key-Value Constrained Generative LLM Inference(https://arxiv.org/abs/2402.06262)

Python625Updated 1 week ago

cache-evictioncache-managementkv-cachellm

dataflowr/llm_efficiency

KV Cache & LoRA for minGPT

Python597Updated 17 hours ago

kv-cachellmlorapytorch

NoakLiu/PiKV

PiKV: KV Cache Management System for Mixture of Experts [Efficient ML System]

Python497Updated 4 days ago

distributed-systemskv-cachekvcachemanagement-systemmixture-modelmixture-of-expertsmlsystemmoeparallel-computing

MaxBelitsky/cache-steering

KV Cache Steering for Inducing Reasoning in Small Language Models

Python465Updated 1 month ago

activation-steeringcache-steeringkv-cachelarge-language-modelsllmreasoningreasoning-language-modelsrepresentation-engineering

blackbird-io/blackbird

A high-performance RDMA distributed file system for fast LLM Inference and GPU Training

C++465Updated 4 days ago

big-datacppcudadistributed-cachegpugpu-multiplexinginfinibandkv-cachellm-frameworkllm-servingpythonrdmasglangucxvllm

xcena-dev/maru

High-Performance KV Cache Storage Engine on CXL Shared Memory for LLM Inference

Python260Updated 9 hours ago

cxlkv-cachellmlmcachepythonshared-memoryvllmzero-copy

neelsomani/kv-marketplace

Cross-GPU KV Cache Marketplace

Python233Updated 1 month ago

inferencekv-cachemachine-learning

ZhanqiuHu/flash-dlm-experimental

Implementation of Flash-DLM (paper: FlashDLM: Accelerating Diffusion Language Models via Efficient KV Caching and Guided Diffusion). Provides training-free methods to accelerate diffusion language model inference.

Python183Updated 1 week ago

accelerationdiffusion-langauge-modeldiffusion-large-language-modeldiffusion-modelsefficient-deep-learningefficient-inferencekv-cachelarge-language-modelsparallel-decoding

vibhanshu2001/memcloud

MemCloud is a distributed in-memory data store written in Rust. It allows nodes (such as macOS, Windows and Linux machines) on a local network to pool their RAM, creating a shared, ephemeral storage cloud.

TypeScript121Updated 2 months ago

clidaemondistributed-memorydistributed-systemsinfrastructurekv-cachelanllmlocal-firstmdnspeer-to-peerramrustsdk

HankYe/KVCOMMArchived

[NeurIPS'25] KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Python123Updated 2 weeks ago

kv-cachemulti-agent-systemsneurips-2025

Page 1 of 3