AW
awesomelistsio/awesome-ai-infrastructure
A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.
Awesome AI Infrastructure 
A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.
Contents
- Distributed Training
- Model Serving and Deployment
- MLOps and Automation
- Data Management
- Optimization Tools
- Infrastructure as Code
- Cloud Platforms
- Learning Resources
- Books
- Community
- Contribute
- License
Distributed Training
- Horovod - A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.
- Ray - A framework for building scalable distributed applications, including distributed AI and reinforcement learning.
- PyTorch Distributed - Tools and libraries for distributed training in PyTorch.
- DeepSpeed - A deep learning optimization library that makes distributed training easy and efficient.
- MPI for Machine Learning - Using the Message Passing Interface (MPI) standard for distributed machine learning.
Model Serving and Deployment
- TensorFlow Serving - A flexible, high-performance serving system for machine learning models.
- TorchServe - A model serving framework for PyTorch, providing fast and efficient model deployment.
- NVIDIA Triton Inference Server - A scalable model serving platform supporting multiple frameworks.
- ONNX Runtime - A cross-platform, high-performance scoring engine for serving ONNX models.
- Seldon Core - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
- KFServing (KServe) - A Kubernetes-based model serving solution as part of the Kubeflow project.
MLOps and Automation
- MLflow - An open-source platform for managing the end-to-end machine learning lifecycle.
- Kubeflow - A platform for orchestrating machine learning workflows on Kubernetes.
- DVC (Data Version Control) - A tool for version control and reproducibility in machine learning projects.
- ZenML - An extensible MLOps framework for creating portable, production-ready machine learning pipelines.
- Airflow - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
- Metaflow - A human-centric framework for building and managing real-life data science projects, developed by Netflix.
Data Management
- Delta Lake - An open-source storage layer that brings reliability to data lakes.
- Apache Hudi - A data management framework that simplifies incremental data processing and streaming analytics.
- Feast - An open-source feature store for managing and serving machine learning features.
- Great Expectations - A tool for data validation and testing in machine learning workflows.
- LakeFS - An open-source data versioning platform for managing data lakes.
Optimization Tools
- NVIDIA TensorRT - A high-performance deep learning inference optimizer and runtime.
- Apache TVM - A deep learning compiler stack for optimizing models on various hardware backends.
- Intel OpenVINO - A toolkit for optimizing and deploying AI inference on Intel hardware.
- OctoML - An AI model optimization platform for efficient deployment on edge and cloud.
- Quantization Aware Training (QAT) - Tools for optimizing model performance through quantization.
Infrastructure as Code
- Terraform - A tool for building, changing, and versioning infrastructure safely and efficiently.
- Pulumi - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
- Ansible - An open-source automation tool for provisioning and managing infrastructure.
- AWS CloudFormation - A service for automating AWS resource deployment and management.
- Google Deployment Manager - An infrastructure management tool for Google Cloud Platform.
Cloud Platforms
- AWS SageMaker - A comprehensive platform for building, training, and deploying machine learning models on AWS.
- Google AI Platform - Google Cloud’s integrated environment for AI development and deployment.
- Azure Machine Learning - A cloud-based platform for training, deploying, and managing machine learning models.
- IBM Watson Studio - A suite of tools for data science, machine learning, and AI model development.
- Paperspace Gradient - A cloud platform for developing, training, and deploying machine learning models.
Learning Resources
- Coursera: MLOps Fundamentals - A course on MLOps best practices for machine learning projects.
- Google Cloud: ML Operations - Training resources on MLOps and model deployment.
- AWS SageMaker Workshops - Example projects and tutorials for using AWS SageMaker.
- Kubeflow Documentation - Official documentation and guides for using Kubeflow.
- PyTorch Distributed Training Guide - A tutorial on distributed training with PyTorch.
Books
- Machine Learning Engineering by Andriy Burkov - A book on building scalable machine learning infrastructure.
- Building Machine Learning Powered Applications by Emmanuel Ameisen - A guide to building robust ML applications in production.
- Designing Data-Intensive Applications by Martin Kleppmann - A comprehensive guide to building scalable and reliable data systems.
- MLOps: Data Science in Production by Mark Treveil and The Dotscience Team - A book on best practices for MLOps and model deployment.
- Reliable Machine Learning by Cathy Chen - A book on creating resilient machine learning infrastructure.
Community
- MLOps Community - A global community focused on MLOps and AI infrastructure.
- Reddit: r/MachineLearning - A subreddit for discussions on machine learning infrastructure and tools.
- Kubeflow Slack - A Slack community for discussing Kubeflow and machine learning pipelines.
- Paperspace Forums - A community forum for discussing machine learning infrastructure and tools.
- GitHub: MLOps Repositories - A collection of open-source MLOps projects on GitHub.
Contribute
Contributions are welcome. Please ensure your submission fully follows the requirements outlined in CONTRIBUTING.md, including formatting, scope alignment, and category placement.
Pull requests that do not adhere to the contribution guidelines may be closed.
License
This awesome list contains affiliate links.