"topic:llm-inference" — Search

Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.

Rust5.9k355Updated just now

ai-gatewayai-gateway-supportenvoyenvoyproxygatewaygenerative-aillm-gatewayllm-inferencellm-proxyllm-routingllmopsllmsopenaipromptproxyproxy-serverrouting

algorithmicsuperintelligence/openevolve

Open-source implementation of AlphaEvolve

Python5.6k882Updated 2 hours ago

alpha-evolvealphacodealphaevolvecoding-agentdeepminddeepmind-labdiscoverydistributed-evolutionary-algorithmsevolutionary-algorithmsevolutionary-computationgenetic-algorithmgenetic-algorithmsiterative-methodsiterative-refinementllm-engineeringllm-ensemblellm-inferenceopenevolveoptimize

superduper-io/superduper

Superduper: End-to-end framework for building custom AI applications and agents.

Python5.3k534Updated 2 days ago

aichatbotdatadatabasedistributed-mlinferencellm-inferencellm-servingllmopsmlmlopsmongodbpretrained-modelspythonpytorchragsemantic-searchtorchtransformersvector-search

kserve/kserve

Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes

Go5.2k1.4kUpdated 8 hours ago

artificial-intelligencecncfgenaihacktoberfestistiok8sknativekservekubeflowkubernetesllm-inferencemachine-learningmlopsmodel-interpretabilitymodel-servingpytorchservice-meshtensorflowvllmxgboost

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

Python5.1k782Updated 2 hours ago

attentioncudadistributed-inferencegpujitlarge-large-modelsllm-inferencemoenvidiapytorch

xlite-dev/Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python5.0k350Updated 2 hours ago

awesome-llmdeepseekdeepseek-r1deepseek-v3flash-attentionflash-attention-3flash-mlallm-inferenceminimax-01mlapaged-attentionqwen3tensorrt-llmvllm

FellouAI/eko

Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai

TypeScript4.9k432Updated 3 hours ago

agentagentic-aiagentic-ai-developmentagentic-frameworkagentic-workflowagentsai-agentsbrowser-automationbrowserusechain-of-thoughtcomputer-automationcomputerusegenaillm-agentsllm-inferencellmapinatural-language-inferenceprompt-engineeringragworkflow

gpustack/gpustack

Performance-optimized AI inference on your GPUs. Unlock superior throughput by selecting and tuning engines like vLLM or SGLang.

Python4.6k470Updated 3 hours ago

ascendcudadeepseekdistributed-inferencegenaihigh-performance-inferenceinferencellamallmllm-inferencellm-servingmaasmindieopenaiqwenrocmsglangvllm

cactus-compute/cactus

Low-latency AI engine for mobile devices & wearables

C4.4k329Updated 3 hours ago

aiandroidarmedgeedge-aiframeworkiosllamacppllmllm-inferencellmsmobilemobile-inferenceon-device-aiquantizragsmartphonespeechtransformerwhisper

NVIDIA/GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.

Jupyter Notebook3.8k991Updated 7 hours ago

gpu-accelerationlarge-language-modelsllmllm-inferencemicroservicenemoragretrieval-augmented-generationtensorrttriton-inference-server

Michael-A-Kuykendall/shimmy

⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.

Rust3.8k289Updated 2 hours ago

api-servercommand-line-tooldeveloper-toolsggufhuggingfacehuggingface-modelshuggingface-transformersinference-serverllamallamacppllm-inferencelocal-ailoramachine-learningollama-apiopenai-compatiblerustrust-cratetransformers

predibase/lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Python3.7k309Updated 2 days ago

fine-tuninggptllamallmllm-inferencellm-servingllmopsloramodel-servingpytorchtransformers

algorithmicsuperintelligence/optillm

Optimizing inference proxy for LLMs

Python3.4k265Updated 2 hours ago

agentagentic-aiagentic-frameworkagentic-workflowagentsapi-gatewaychain-of-thoughtgenailarge-language-modelsllmllm-inferencellmapimixture-of-expertsmoamonte-carlo-tree-searchopenaiopenai-apioptimizationprompt-engineeringproxy-server

neuralmagic/deepsparseArchived

Sparsity-aware deep learning inference runtime for CPUs

Python3.2k190Updated 6 days ago

computer-visioncpusdeepsparseinferencellm-inferencemachinelearningnlpobject-detectiononnxperformancepretrained-modelspruningquantizationsparsification

ruvnet/RuVector

RuVector is a High Performance, Real-Time, Self-Learning, Vector Graph Neural Network, and Database built in Rust.

Rust3.1k343Updated just now

aiai-ocrattention-mechanismgnngnn-modelgnnsgraphgraph-neural-networksllm-inferencelow-latencymincutneo4jocronnxrustvectorwasm

b4rtaz/distributed-llama

Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.

C++2.9k215Updated 1 day ago

distributed-computingdistributed-llmllama2llama3llmllm-inferencellmsneural-networkopen-llm

spiceai/spiceai

A portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.

Rust2.8k175Updated 6 hours ago

artificial-intelligencedatadata-federationdevelopersfull-text-searchinfrastructurellm-inferencemachine-learningsql

FasterDecoding/Medusa

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook2.7k195Updated 3 hours ago

llmllm-inference

lemonade-sdk/lemonade

Lemonade helps users discover and run local AI apps by serving optimized LLMs right from their own GPUs and NPUs. Join our discord: https://discord.gg/5xXzkMu8Zk

C++2.3k203Updated 1 hour ago

aiamdgenaigpullamallmllm-inferencelocal-servermcpmcp-servermistralnpuonnxruntimeopenai-apiqwenradeonrocmryzenvulkan

Page 1 of 34