"topic:inference-optimization" — Search

LightTTS is a lightweight TTS inference framework optimized for CosyVoice2 and CosyVoice3, enabling fast and scalable speech synthesis in Python and supports stream and bistream modes.

Python317Updated 2 weeks ago

audio-generationcosyvoicecosyvoice2cosyvoice3inference-optimizationlow-latencyreal-timespeech-synthesistensorrttext-to-speechtts

John-Wendell/Attention-MoA

Official code of Attention-MoA: Enhancing Mixture-of-Agents via Inter-Agent Semantic Attention and Deep Residual Synthesis

Python240Updated 1 month ago

inference-optimizationllm-agentsllm-inferencellmsmixture-of-agentsmulti-agentmulti-agent-collaborationmulti-agent-systems

ksm26/Efficiently-Serving-LLMs

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

Jupyter Notebook185Updated 1 year ago

batch-processingdeep-learning-techniquesinference-optimizationlarge-scale-deploymentmachine-learning-operationsmodel-accelerationmodel-inference-servicemodel-servingoptimization-techniquesperformance-enhancementscalability-strategiesserver-optimizationserving-infrastructuretext-generation

lmaxwell/Armednn

cross-platform modular neural network inference library, small and efficient

C++132Updated 2 years ago

conv1deigeneigen3inference-engineinference-optimizationlstmneural-network

ResponsibleAILab/DAM

Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.

Python120Updated 9 months ago

efficient-aiinference-optimizationsparse-attention

zzbright1998/SentenceKV

Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.

Python111Updated 5 months ago

colm2025efficient-inferenceinference-optimizationkv-cachellmmemory-efficiencynatural-language-processingsemantic-cachingtransformers

fangvv/EdgeKE

Code for paper "EdgeKE: An On-Demand Deep Learning IoT System for Cognitive Big Data on Industrial Edge Devices"

Python113Updated 6 months ago

accuracy-requirementsbranchynetdeep-learningdeep-neural-networksearly-exitedge-applicationsedge-computingedge-devicesefficient-aiinferenceinference-optimizationiotjetson-nanoknowledge-distillationresnetresnets

grazder/template.cpp

A template for getting started writing code using GGML

C++110Updated 1 year ago

cppdeep-learningggmlinference-optimization

ccs96307/fast-llm-inference

Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.

Python111Updated 8 months ago

accelerationinference-optimizationlarge-language-modelsspeculative-decoding

Harly-1506/Faster-Inference-yolov8

Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢

Python102Updated 1 year ago

image-processinginference-optimizationnumpy-arraysnumpy-implementationobject-detectionopencvopenvinoopenvino-toolkitsegmentationtorchultralyticsyolov8

amazon-science/llm-rank-pruning

LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.

Python83Updated 1 year ago

graph-theoryinference-optimizationlarge-language-modelsllmllmspagerankpruningweighted-pagerank

yester31/TensorRT_Examples

TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization

Python82Updated 5 days ago

d-finedepth-proefficientvitsameomtinference-optimizationonnxpost-training-quantizationpruningquantizationquantization-aware-trainingreal-esrgansam2sparsitytensorrtyolov12

ZhngQ1/Measuring-Noise-Level-Dependence-in-Conditional-Image-Generation

Quantitative framework for measuring how conditioning effectiveness varies with noise level in diffusion model inference (SD 1.5 & SDXL)

TeX71Updated 5 days ago

classifier-free-guidancecomputer-visionconditional-generationdiffusion-modelsinference-optimizationstable-diffusion

PRITHIVSAKTHIUR/Multimodal-OCR3

Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images.

Python71Updated 4 weeks ago

chandra-ocrdotsocrhuggingface-modelshuggingface-spaceshuggingface-transformersinference-optimizationmatplotlibmultimodal-large-language-modelsnanonetsocrocr-recognitionolmocr2pillowpytorchqwen2-5-vlqwen3-vlsota-modelvision-language-modelvision-transformer

Page 1 of 3