1,572 results for “topic:llm-inference”
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Find secrets with Gitleaks 🔑
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
Official inference library for Mistral models
OpenVINO™ is an open source toolkit for optimizing and deploying AI inference
High-speed Large Language Model Serving for Local Deployment
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.
Open-source implementation of AlphaEvolve
Superduper: End-to-end framework for building custom AI applications and agents.
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
FlashInfer: Kernel Library for LLM Serving
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Eko (Eko Keeps Operating) - Build Production-ready Agentic Workflow with Natural Language - eko.fellou.ai
Performance-optimized AI inference on your GPUs. Unlock superior throughput by selecting and tuning engines like vLLM or SGLang.
Low-latency AI engine for mobile devices & wearables
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Optimizing inference proxy for LLMs
Sparsity-aware deep learning inference runtime for CPUs
RuVector is a High Performance, Real-Time, Self-Learning, Vector Graph Neural Network, and Database built in Rust.
Distributed LLM inference. Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
A portable accelerated SQL query, search, and LLM-inference engine, written in Rust, for data-grounded AI apps and agents.
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Lemonade helps users discover and run local AI apps by serving optimized LLMs right from their own GPUs and NPUs. Join our discord: https://discord.gg/5xXzkMu8Zk