35 results for “topic:fsdp”
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
Best practices & guides on how to write distributed pytorch training code
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
Repo for Qwen Image Finetune
A PyTorch native library for training speculative decoding models
META LLAMA3 GENAI Real World UseCases End To End Implementation Guide
Llama-style transformer in PyTorch with multi-node / multi-GPU training. Includes pretraining, fine-tuning, DPO, LoRA, and knowledge distillation. Scripts for dataset mixing and training from scratch.
🦾💻🌐 distributed training & serverless inference at scale on RunPod
A comprehensive hands-on guide to building production-grade distributed applications with Ray - from distributed training and multimodal data processing to inference and reinforcement learning.
Fast and easy distributed model training examples.
A script for training the ConvNextV2 on CIFAR10 dataset using the FSDP technique for a distributed training scheme.
Simple and efficient implementation of 671B DeepSeek V3 that trainable with FSDP+EP and minimal requirement of 256x A100/H100, targeted for HuggingFace ecosystem
A minimal, hackable pre-training stack for GPT-style language models
A minimal, hackable pre-training stack for GPT-style language models
Minimal yet high performant code for pretraining llms. Attempts to implement some SOTA features. Implements training through: Deepspeed, Megatron-LM, and FSDP. WIP
Billus 大模型技能库:面向 LLM、VL、多模态与图像生成模型的训练、调参、剪枝、量化与工程化技能库
Implementations of some popular approaches for efficient deep learning training and inference
Framework, Model & Kernel Optimizations for Distributed Deep Learning - Data Hack Summit
This repository focuses on distributed and parallel computing with PyTorch, covering model parallelism, data parallelism, and advanced optimization techniques. It provides resources for scaling AI training and inference efficiently across multiple devices.
Scalable multimodal AI system combining FSDP, RLHF, and Inferentia optimization for customer insights generation.
Dataloading for JAX
Training Qwen3 to solve Wordle using SFT and GRPO
A foundational repository for setting up distributed training jobs using Kubeflow and PyTorch FSDP.
🎨 Generate high-quality images with the Qwen-Image model, a powerful text-to-image tool optimized for fast and efficient deployment on serverless architecture.
FSDP and DeepSpeed ZeRO distributed training template for large vision models
This repository showcases hands-on projects leveraging distributed multi-GPU training to fine-tune large language models (LLMs).
Mini-FSDP for PyTorch. Minimal single-node Fully Sharded Data Parallel wrapper with param flattening, grad reduce-scatter, AMP, and tiny GPT/BERT training examples.
High-performance RLHF/GRPO pipeline scaling Gemma 3 on GKE Ray Clusters (B200/H200) using NVIDIA NeMo-RL. Includes native FSDP checkpoint merging and zero-shot vLLM benchmarking.
Custom FSDP + LoRA training loop for Ministral 3 (axolotl workaround)
Comprehensive exploration of LLMs, including cutting-edge techniques and tools such as parameter-efficient fine-tuning (PEFT), quantization, zero redundancy optimizers (ZeRO), fully sharded data parallelism (FSDP), DeepSpeed, and Huggingface accelerate.