245 results for “topic:model-serving”
A high-throughput and memory-efficient inference and serving engine for LLMs
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Standardized Distributed Generative and Predictive AI Inference Platform for Scalable, Multi-Framework Deployment on Kubernetes
In this repository, I will share some useful notes and references about deploying deep learning-based models in production.
Olares: An Open-Source Personal Cloud to Reclaim Your Data
FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑💻 Video Tutorials.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
A framework for efficient model inference with omni-modality models
🏕️ Reproducible development environment for humans and agents
AICI: Prompts as (Wasm) Programs
Community maintained hardware plugin for vLLM on Ascend
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
An open source DevOps tool from the CNCF for packaging and versioning AI/ML models, datasets, code, and configuration into an OCI Artifact.
Hopsworks - Data-Intensive AI platform with a Feature Store
The simplest way to serve AI/ML models in production
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
A throughput-oriented high-performance serving framework for LLMs
A highly optimized LLM inference acceleration engine for Llama and its variants.
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
A scalable inference server for models optimized with OpenVINO™
Model Deployment at Scale on Kubernetes 🦄️
Serverless LLM Serving for Everyone.
Ollama for classical ML models. AOT compiler that turns XGBoost, LightGBM, scikit-learn, CatBoost & ONNX models into native C99 inference code. One command to load, one command to serve. 336x faster than Python inference.
FastAPI Skeleton App to serve machine learning models production-ready.
Python + Inference - Model Deployment library in Python. Simplest model inference server ever.
No description provided.
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).