"topic:inference-acceleration" — Search

35 results for “topic:inference-acceleration”

TurboDiffusion: 100–200× Acceleration for Video Diffusion Models

ai-infraconsistency-modeldiffusion-modelsdistillationinference-accelerationmlsystemrcmsageattentionsparse-linear-attentionvideo-generation

thu-ml/SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda3.2k365Updated 5 hours ago

attentioncudaefficient-attentioninference-accelerationllmllm-inframlsysquantizationtritonvideo-generatevideo-generationvit

ali-vilab/TeaCache

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Python1.3k52Updated 1 day ago

cogvideoxdiffusion-modelshunyuan-videoinference-accelerationlatteopen-soraopen-sora-planvideo-generation

thu-ml/SpargeAttn

[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.

Cuda95387Updated 2 days ago

ai-infraattentioninference-accelerationllmmlsysquantizationsageattentionsparse-attentionvideo-generationvision-transformervit

CurvineIO/curvine

High-performance distributed multi-tier cache system. Built in Rust.

Rust59773Updated 20 hours ago

aiai-infracloud-nativefluidhigh-performance-computinginference-accelerationlancelancedbmodel-deploymentmodel-trainingopendalrusts3

H-

H-EmbodVis/EasyCache

Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

Python2874Updated 6 days ago

diffusion-modelshunyuan-videoinference-accelerationvideo-generation

thu-ml/SLA

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

Python28418Updated 6 days ago

ai-infradiffusion-transformerinference-accelerationlinear-attentionmlsyssparse-attentionsparse-linear-attentiontrain-accelerationtransformervideo-generation

SJTU-DENG-Lab/Discrete-Diffusion-Forcing

Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference

Python24217Updated 1 day ago

dllminference-accelerationllm

czg1225/AsyncDiff

[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Python21213Updated 2 days ago

diffusion-modelsdistributed-computingefficient-inferenceinference-accelerationstable-diffusiontext-to-imagetext-to-videotraining-free

autonomi-ai/nos

⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.

Python14712Updated 1 month ago

computer-visiongenerative-aiinferenceinference-accelerationllm-inferencemachine-learning

JIA-Lab-research/Q-LLM

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

Python554Updated 3 weeks ago

fast-inferenceinference-accelerationkv-cache-compressionlarge-language-modelslong-context

Tencent/KsanaDiT

KsanaDiT: High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation

Python465Updated 2 days ago

ai-infraattentioncudadiffusioninferenceinference-accelerationpytorchqwen-imagetransformervideo-generationwan-video

Shwai-He/MEO

The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":

Python444Updated 1 week ago

inference-accelerationmixture-of-expertsmodel-mergingsupervised-finetuning

ims-kdks/Learning-to-Parallel-Decoding

[ICLR 2026] Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Python310Updated 1 day ago

diffusion-language-modelsiclr2026inference-accelerationllm

MILVLG/twigvlm

Implementation of ICCV 2025 paper "Growing a Twig to Accelerate Large Vision-Language Models".

Python263Updated 10 hours ago

inference-accelerationpytorchtoken-pruningvision-language-models

jagennath-hari/DepthStream-Accelerator-ROS2-Integrated-Monocular-Depth-Inference

DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.

C++250Updated 4 months ago

computer-visiondeep-learningdepth-estimationinference-accelerationmonocular-depth-estimationroboticsros2tensorrt-inferencevision-tranformer

actypedef/MixedGemm

a mixed-precision gemm with quantize and reorder kernel.

Python141Updated 6 months ago

cudainference-accelerationllmmlsysquantization

CASE-Lab-UMD/Capacity-Aware-MoE

The official implementation of the paper "Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts" (ICLR 2026).

Python121Updated 1 week ago

inference-accelerationlanguage-modelsload-balancingmixture-of-expertsmultimodal-modelstest-time-optimization

bzluan/AdaptPrune

The official repo for “Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models”.

Python100Updated 1 month ago

inference-accelerationllmslvlms

fangvv/TLEE

Code for paper "TLEE: Temporal-wise and Layer-wise Early Exiting Network for Efficient Video Recognition on Edge Devices"

Python90Updated 5 months ago

action-recognitioncontent-analysisdeep-learning-algorithmsdnndnn-inferencedynamic-inferenceearly-exitingframe-analysisinferenceinference-accelerationon-device-airesnetvideo-analyticsvideo-recognition

actypedef/AURA

AURA: Augmented Representation for Unified Accuracy-aware Quantization

Cuda80Updated 4 months ago

cudainference-accelerationllmptqquantization

FofGofx/WorldCache

Official Repo for WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

Python71Updated 2 hours ago

free-traininginference-accelerationtoken-cacheworld-model

fangvv/RSEE

Code for paper "Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices"

Python72Updated 1 month ago

action-recognitionbigdatacontent-analysisdeep-learning-algorithmsdnndnn-inferencedynamic-inferenceearly-exitingframe-analysishmdb51inferenceinference-accelerationresnetresolution-cnnucf101video-analyticsvideo-recognition

marty1885/scirknn

Convert and run scikit-learn MLPs on Rockchip NPU.

Python72Updated 1 year ago

inference-accelerationnpurk3566rk3588rknnrknpu2rockchipscikit-learn

fangvv/MTACP

Code for paper "Deep Reinforcement Learning based Multi-task Automated Channel Pruning for DNNs"

Python61Updated 2 months ago

deep-learningdeep-reinforcement-learningdistributeddnnijcnnimpalainference-accelerationmulti-tasknasnetwork-architecture-searchneural-networksprunningreinforcement-learning-algorithms

umitkacar/onnx-tensorrt-optimization

40x faster AI inference: ONNX to TensorRT optimization with FP16/INT8 quantization, multi-GPU support, and deployment

Python40Updated 1 month ago

cudadeep-learningedge-computingfp16gpu-accelerationinference-accelerationint8latency-optimizationmlopsmodel-deploymentmodel-optimizationnvidia-gpuonnxonnxruntimeproduction-aipytorch-to-onnxquantizationreal-time-inferencetensorflow-to-onnxtensorrt

xiaomomy/MAMBO-G

MAMBO-G: A training-free, magnitude-aware adaptive guidance framework for accelerating Classifier-Free Guidance (CFG). Dynamically mitigates early-step overshoot in flow-matching models (SD3.5, Qwen-Image, Wan2.1) to achieve 3x-4x speedup with high visual fidelity. Integrated into Hugging Face Diffusers.

Python20Updated 1 day ago

diffusion-modelsflow-matchinggenerative-aihuggingface-diffusersinference-accelerationmambo-gqwen-imagestable-diffusion

eren23/pruneren

Intelligent layer pruning toolkit for LLMs featuring iterative optimization, self-healing algorithms, and comprehensive benchmarking.

Python10Updated 3 months ago

ablation-studyaiefficient-llmhuggingfacehuggingface-transformersinference-accelerationlayer-pruningllmmergekitmodel-compressionmodel-optimizationnlpoptimizationpytorchtransformers

ZihuaMeng/FastV-LLaVA-Reproduction

Reproduction of FastV (arXiv:2403.06764) on LLaVA-1.5-7B with INT4 quantization. 22% speedup via visual token pruning on 8GB VRAM.

Python10Updated 12 hours ago

deep-learninginference-accelerationllmpytorchvision-language-model

LucaArduini/MLP_Verilog

Project for the "Symbolic and Evolutionary Artificial Intelligence" class at Pisa University

C++10Updated 8 months ago

hardware-accelerationinference-accelerationmlp-regressor

Page 1 of 2