"topic:fsdp" — Search

Llama-style transformer in PyTorch with multi-node / multi-GPU training. Includes pretraining, fine-tuning, DPO, LoRA, and knowledge distillation. Scripts for dataset mixing and training from scratch.

Python213Updated 1 day ago

causal-lmddpdeep-learningdistributed-trainingdpofinetuningfsdphuggingfaceinstruction-tuningknowledge-distillationllamallmloramachine-learningnlppretrainingpytorchsfttext-generationtransformer

SohamGovande/podplex

🦾💻🌐 distributed training & serverless inference at scale on RunPod

Jupyter Notebook193Updated 2 months ago

decentralizeddistributedfsdpinferencellmrunpodserverless

debnsuma/ray-for-developers

A comprehensive hands-on guide to building production-grade distributed applications with Ray - from distributed training and multimodal data processing to inference and reinforcement learning.

Jupyter Notebook193Updated 3 weeks ago

ddpdeep-learningdistributed-computingdistributed-trainingfsdpmachine-learningmlopsmodel-servingmultimodalpytorchray

AlibabaPAI/FlashModels

Fast and easy distributed model training examples.

Python124Updated 5 months ago

data-parallelismdeep-learningdistributed-trainingfsdpllmmodel-parallelismpytorchsequence-parallelismtensor-parallelismxlazero

arawxx/FSDP-Distributed-Training-of-ConvNextV2-on-CIFAR10

A script for training the ConvNextV2 on CIFAR10 dataset using the FSDP technique for a distributed training scheme.

Python81Updated 1 year ago

convnextconvnextv2deep-learningdistributeddistributed-computingdistributed-learningdistributed-trainingfsdpfully-sharded-data-parallelpytorch

zigzagcai/DeepSeekV3

Simple and efficient implementation of 671B DeepSeek V3 that trainable with FSDP+EP and minimal requirement of 256x A100/H100, targeted for HuggingFace ecosystem

Python71Updated 1 month ago

expert-parallelfsdpreinforcement-learning

liangyuwang/train-large-model-from-scratch

A minimal, hackable pre-training stack for GPT-style language models

Python70Updated 2 weeks ago

fsdpgptllmsmegatron

liangyuwang/Tinytron

A minimal, hackable pre-training stack for GPT-style language models

Python61Updated 4 days ago

fsdpmegatronpytorch

SulRash/minLLMTrain

Minimal yet high performant code for pretraining llms. Attempts to implement some SOTA features. Implements training through: Deepspeed, Megatron-LM, and FSDP. WIP

Python60Updated 1 year ago

deepspeedfsdphuggingfacellmmegatron-lmpretraining

BillusA1111/billus-model-skill-library

Billus 大模型技能库：面向 LLM、VL、多模态与图像生成模型的训练、调参、剪枝、量化与工程化技能库

Python60Updated 1 day ago

codexdeepspeeddiffusionfsdphuggingfacellmmultimodalpeftqloraskilltrainingvl

free001style/efficient-dl-systems

Implementations of some popular approaches for efficient deep learning training and inference

Python20Updated 5 months ago

ampfsdpmultigpuoffloadingprofilerquantitensorparallel

abhilash1910/Framework-Optimization

Framework, Model & Kernel Optimizations for Distributed Deep Learning - Data Hack Summit

Python21Updated 1 year ago

codegenddpdeepspeedfsdpinductorpipelineparallelpytorchtensorparalleltriton

SuZeAI/DP

This repository focuses on distributed and parallel computing with PyTorch, covering model parallelism, data parallelism, and advanced optimization techniques. It provides resources for scaling AI training and inference efficiently across multiple devices.

Jupyter Notebook10Updated 19 hours ago

ddpdistributedfsdpparallelparallel-computingparallel-datatensor-parallelism

fereydoonboroojerdi/multimodal-customer-insights-generator

Scalable multimodal AI system combining FSDP, RLHF, and Inferentia optimization for customer insights generation.

Python10Updated 8 months ago

awscustomer-insightsdeep-learningfsdpinferentiamultimodal-aipytorchrlhfsagemaker

walln/loadax

Dataloading for JAX

Python10Updated 1 year ago

dataloadingdatasetsddpdistributed-trainingfsdpjaxxla

shreyansh26/wordle-solver

Training Qwen3 to solve Wordle using SFT and GRPO

Python01Updated 5 months ago

fsdpgrpollmqwen3rftrlsfttensor-parallelismwordlewordle-solver

hyunnnchoi/google-t5-fsdp-kubeflow

A foundational repository for setting up distributed training jobs using Kubeflow and PyTorch FSDP.

Python00Updated 1 year ago

distributed-deep-learningfsdpkubeflowpytorch

wonmo4692/qwen-image

🎨 Generate high-quality images with the Qwen-Image model, a powerful text-to-image tool optimized for fast and efficient deployment on serverless architecture.

00Updated 3 months ago

deepspeedfine-tuningfluxflux-devflux-kontextflux-schnellfsdpggmlimage-edit-modelimage-generationmpsqwen-image-editstable-diffusiontext2imagetxt2imgvideo-language-modelvideogenerationwan

SharvenRane/distributed-training

FSDP and DeepSpeed ZeRO distributed training template for large vision models

Python00Updated 6 days ago

deepspeeddistributed-trainingfsdplarge-scalepytorch

sparklerz/multigpu-llm-finetuning

This repository showcases hands-on projects leveraging distributed multi-GPU training to fine-tune large language models (LLMs).

Python00Updated 7 months ago

ddpdeepspeeddistributed-trainingfsdpllmmlflowmlopsmosaicmlpipeline-parallelismpytorchrayray-trainray-tunewandb

salma2vec/shardspark

Mini-FSDP for PyTorch. Minimal single-node Fully Sharded Data Parallel wrapper with param flattening, grad reduce-scatter, AMP, and tiny GPT/BERT training examples.

Python00Updated 5 months ago

distributed-trainingfsdpparallelismpytorchsystems

sunbc0120/b200-nemo-rl

High-performance RLHF/GRPO pipeline scaling Gemma 3 on GKE Ray Clusters (B200/H200) using NVIDIA NeMo-RL. Includes native FSDP checkpoint merging and zero-shot vLLM benchmarking.

Shell00Updated 20 hours ago

fsdpgemma-3gkegrpollmmath-500nemo-rlnvidia-nemoray-clusterreinforcement-learningvllm

thestarfarer/ministral3-fsdp-lora-loop

Custom FSDP + LoRA training loop for Ministral 3 (axolotl workaround)

Python00Updated 2 months ago

fine-tuningfsdpllmloramachine-learningministralministral3pytorchtransformers

HROlive/Large-Language-Models-on-Supercomputers

Comprehensive exploration of LLMs, including cutting-edge techniques and tools such as parameter-efficient fine-tuning (PEFT), quantization, zero redundancy optimizers (ZeRO), fully sharded data parallelism (FSDP), DeepSpeed, and Huggingface accelerate.

Jupyter Notebook00Updated 1 year ago

deepspeedevaluation-metricsfsdphigh-performance-computinghpchuggingfacehuggingface-transformersjupyterllmllm-inferencellm-trainingmonitoringpeftpythonquantizationslurmtokenizationtransformerunsloth

Page 1 of 2