35 results for “topic:inference-acceleration”
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
[ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.
High-performance distributed multi-tier cache system. Built in Rust.
Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
KsanaDiT: High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation
The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":
[ICLR 2026] Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding
Implementation of ICCV 2025 paper "Growing a Twig to Accelerate Large Vision-Language Models".
DepthStream Accelerator: A TensorRT-optimized monocular depth estimation tool with ROS2 integration for C++. It offers high-speed, accurate depth perception, perfect for real-time applications in robotics, autonomous vehicles, and interactive 3D environments.
a mixed-precision gemm with quantize and reorder kernel.
The official implementation of the paper "Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts" (ICLR 2026).
The official repo for “Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models”.
Code for paper "TLEE: Temporal-wise and Layer-wise Early Exiting Network for Efficient Video Recognition on Edge Devices"
AURA: Augmented Representation for Unified Accuracy-aware Quantization
Official Repo for WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching
Code for paper "Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices"
Convert and run scikit-learn MLPs on Rockchip NPU.
Code for paper "Deep Reinforcement Learning based Multi-task Automated Channel Pruning for DNNs"
40x faster AI inference: ONNX to TensorRT optimization with FP16/INT8 quantization, multi-GPU support, and deployment
MAMBO-G: A training-free, magnitude-aware adaptive guidance framework for accelerating Classifier-Free Guidance (CFG). Dynamically mitigates early-step overshoot in flow-matching models (SD3.5, Qwen-Image, Wan2.1) to achieve 3x-4x speedup with high visual fidelity. Integrated into Hugging Face Diffusers.
Intelligent layer pruning toolkit for LLMs featuring iterative optimization, self-healing algorithms, and comprehensive benchmarking.
Reproduction of FastV (arXiv:2403.06764) on LLaVA-1.5-7B with INT4 quantization. 22% speedup via visual token pruning on 8GB VRAM.
Project for the "Symbolic and Evolutionary Artificial Intelligence" class at Pisa University