86 results for “topic:inference-optimization”
High-efficiency floating-point neural network inference operators for mobile, server, and Web
Efficient Deep Learning Systems course materials (HSE, YSDA)
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
The Tensor Algebra SuperOptimizer for Deep Learning
Everything you need to know about LLM inference
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
Optimize layers structure of Keras model to reduce computation time
A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
Accelerating Long Context LLM Inference with Accuracy-Preserving Context Optimization in SGLang, vLLM, llama.cpp, RAG, and Agentic AI.
The blog, read report and code example for AGI/LLM related knowledge.
Krasis is a Hybrid LLM runtime which focuses on efficient running of larger models on consumer grade VRAM limited hardware
Optimizing Monocular Depth Estimation with TensorRT: Model Conversion, Inference Acceleration, and 3D Reconstruction
Run 70B+ LLMs on a single 4GB GPU — no quantization required.
LightTTS is a lightweight TTS inference framework optimized for CosyVoice2 and CosyVoice3, enabling fast and scalable speech synthesis in Python and supports stream and bistream modes.
Official code of Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis
Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
cross-platform modular neural network inference library, small and efficient
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.
Code for paper "EdgeKE: An On-Demand Deep Learning IoT System for Cognitive Big Data on Industrial Edge Devices"
A template for getting started writing code using GGML
Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.
Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢
LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.
TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization
Quantitative framework for measuring how conditioning effectiveness varies with noise level in diffusion model inference (SD 1.5 & SDXL)
Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images.