"topic:dpo" — Search

189 results for “topic:dpo”

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

dpoevaluationfine-tuninggpt-ossgpt-oss-120bgpt-oss-20binferencellamallmssftslmsvlms

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型，实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO、GRPO。

Python5.0k703Updated 4 hours ago

chatgptdpogptllamallmmedicalmedicalgpt

PKU-Alignment/align-anything

Align Anything: Training All-modality Model with Feedback

Python4.6k510Updated 4 days ago

chameleondpolarge-language-modelsmultimodalrlhfvision-language-model

ContextualAI/HALOs

A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions (HALOs).

Python90850Updated 3 days ago

alignmentdpohalosktopporlhf

zhaorw02/DeepMesh

[ICCV 2025] Official code of DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

Python70133Updated 10 hours ago

3daigcdpogenerative-modelllmmeshmesh-generationpoint-cloud

sail-sg/oat

🌾 OAT: A research-friendly framework for LLM online alignment, including reinforcement learning, preference learning, etc.

Python63861Updated 3 days ago

alignmentdistributed-rldistributed-trainingdpodueling-banditsgrpollmllm-aligmentllm-explorationonline-alignmentonline-rlppor1-zeroreasoningrlhfthompson-sampling

jianzhnie/LLamaTuner

Easy and Efficient Finetuning LLMs. (Supported LLama, LLama2, LLama3, Qwen, Baichuan, GLM , Falcon) 大模型高效量化训练+部署.

Python62065Updated 5 days ago

chatgptdpollamallama3mixtralppoqloraqwenrlhf

ukairia777/tensorflow-nlp-tutorial

tensorflow를 사용하여 텍스트 전처리부터, Topic Models, BERT, GPT, LLM과 같은 최신 모델의 다운스트림 태스크들을 정리한 Deep Learning NLP 저장소입니다.

Jupyter Notebook575289Updated 4 days ago

bertbert-nerdpohuggingfacekeras-tutorialllamallmloranamed-entity-recognitionnatural-language-processingnlpnlp-tutorialquestion-answeringsfttensorflowtrainertransformers

wendell0218/Awesome-RL-for-Video-Generation

A curated list of papers on reinforcement learning for video generation

4101Updated 20 hours ago

dpogrpopporeinforcement-learningreward-modelvideo-generation

JIA-Lab-research/Step-DPO

Implementation for "Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs"

Python39216Updated 2 weeks ago

dpollmmathreasoning

TUDB-Labs/mLoRA

An Efficient "Factory" to Build Multiple LoRA Adapters

Python37466Updated 1 day ago

baichuanchatglmdpofinetunegpullamallama2llmloramlorapeftrlhf

Goekdeniz-Guelmez/mlx-lm-lora

Train Large Language Models on MLX.

Python28540Updated just now

appledeep-learningdpofinefinetuning-llmsmlrlhfsupervised-machine-learningtraining

armbues/SiLLM

SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.

Python28426Updated 2 months ago

apple-silicondpolarge-language-modelsllmllm-inferencellm-trainingloramlx

RockeyCoss/SPO

[CVPR 2025] Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization

Python26511Updated 2 days ago

diffusion-modelsdposdxltext-to-imagetext-to-image-generation

YangLing0818/IterComp

[ICLR 2025] IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Python20411Updated 3 days ago

dporeward-modelingrlhftext-to-image

TideDra/VL-RLHF

A RLHF Infrastructure for Vision-Language Models

Python1988Updated 3 days ago

dpollmlmmmllmrlhfvlm

argilla-io/notus

Notus is a collection of fine-tuned LLMs using SFT, DPO, SFT+DPO, and/or any other RLHF techniques, while always keeping a data-first approach

Python16915Updated 2 months ago

alignment-handbookdpofine-tuninglm-alignmentpreference-datatrlzephyr

codelion/pts

Pivotal Token Search

Python1469Updated 1 week ago

dataset-generationdirect-preference-optimizationdpollmllm-inferencellm-steeringmech-interpphi-4phi-4-miniphi4phi4-minipivotal-token-searchpivotal-tokensreasoning-agentreasoning-language-modelsreasoning-modelssaesparse-autoencodersteering-vectortokens

anilca/NetTrader.Indicator

Technical anaysis library for .NET

C#14253Updated 4 weeks ago

adladxatrbollinger-bandsccicmfcmodemadpoemaenvelopeichimoku-cloudmacdmomentumobvpvtrocrsisartrix

ScottishFold007/Cosyvoice_DPO_NOTES

CosyVoice_DPO_NOTES: Supercharge Your Cosyvoice model with Cutting-Edge DPO Fine-Tuning!

Python11919Updated 3 days ago

cosyvoicedposfttts

NiuTrans/Vision-LLM-Alignment

This repository contains the code for SFT, RLHF, and DPO, designed for vision-based LLMs, including the LLaVA models and the LLaMA-3.2-vision models.

Python11810Updated 6 days ago

alignmentdpollama3-visionllavallmmllmmulti-modelpporewardrlhfsftvision

AIDC-AI/CHATS

CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation (ICML2025)

Python1142Updated 1 week ago

dposdxltext-to-image

liushunyu/awesome-direct-preference-optimization

A Survey of Direct Preference Optimization (DPO)

910Updated 1 week ago

alignmentdirect-preference-optimizationdpolarge-language-modellarge-language-modelsllmllmspreference-learningreinforcement-learning-from-human-feedbackreviewsurvey

YangLing0818/SuperCorrect-llm

[ICLR 2025] SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction

Python877Updated 1 month ago

dpollmllm-reasoningreflectionself-correction

ModelTC/GenRL

Reinforcement Learning Framework for Visual Generation

Python764Updated 23 hours ago

dpogrpoimagegenerationinfrareinforcement-learningrlvideogenerationwanwan-videowan21

ServiceNow/SyGra

SyGra - Graph-oriented Synthetic data generation Pipeline

Python7514Updated 3 days ago

aidpoimage-datasetsllm-datasetsllm-frameworkllm-training-datalow-code-no-codemultimodalityopen-sourcepythonsft-datasynthetic-datasynthetic-dataset-generation

martin-wey/CodeUltraFeedback

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

Python735Updated 2 months ago

code-generationcodeultrafeedbackdpolarge-language-modelsllm-as-a-judgereinforcement-learning-from-human-feedback

NJUxlj/Chinese-MedQA-Qwen2

基于Qwen2+SFT+DPO的医疗问答系统，项目中使用了自定义的 SFTTrainer/DPOTrainer/TRPOTrainer用于训练，其次，项目还调用各种知识库工具（neo4j, milvus, LDA, 等）进行自动化训练数据生成。另外，使用 vllm 用于推理和部署训好的模型, 该模型会通过 vllm API 来接入一个基于 embedder + Reranker 的 RAG 系统。另外还参考 MDAgents 论文实现了一个多智能体会诊系统，同样也支持 vllm api 接入。

Python659Updated 1 day ago

dpomasqwen2ragtrpovllm

NJUxlj/Travel-Agent-based-on-Qwen2-RLHF

A travel agent based on Qwen2.5, fine-tuned by SFT + DPO/PPO/GRPO using traveling question-answer dataset, a mindmap can be output using the response. A RAG system is build upon the tuned qwen2, using Prompt-Template + Tool-Use + Chroma embedding database + LangChain

Python625Updated 1 day ago

agentdpogrpolangchainlorappoqwen2ragrlhftool-use

sinanuozdemir/oreilly-llm-rl-alignment

This training offers an intensive exploration into the frontier of reinforcement learning techniques with large language models (LLMs). We will explore advanced topics such as Reinforcement Learning with Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), Reasoning LLMs, and demonstrate practical applications such as fine-tuning

Jupyter Notebook5934Updated 5 days ago

agentsaideepseekdpogrpollamappoqwenreinforcement-learningreward-modelingrloo

Page 1 of 7