EL
eliashossain001/grpo-finetune-deepseek-qwen3
GRPO Fine-Tuning on DeepSeek-R1-0528-Qwen3-8B
This repository provides an end-to-end lightweight fine-tuning pipeline using GRPO (Guided Reward Preference Optimization) on the DeepSeek-R1-0528-Qwen3-8B model, enhanced with Unsloth for parameter-efficient training.
Features
- GRPO training with custom reward functions
- Parameter-efficient fine-tuning (LoRA) using Unsloth
- Inference-ready with GGUF/merged/int4 export support
- Model evaluation via prompt completion comparisons
- Easily adaptable to your own dataset (just replace
data/train.jsonwith the required format)
๐ Project Structure
reasoning_finetune/
โโโ configs/ # Configuration scripts for GRPO training
โ โโโ grpo_config.py
โ
โโโ data/ # Dataset handling and preprocessing
โ โโโ preprocess.py
โ
โโโ model/ # Model loading, tokenizer setup, and save utilities
โ โโโ load_model.py
โ โโโ save_model.py
โ
โโโ rewards/ # Reward functions for GRPO
โ โโโ answer_check.py
โ โโโ format_check.py
โ โโโ language_check.py
โ
โโโ train_module/ # Training logic
โ โโโ train.py
โ
โโโ inference/ # Inference and comparison script
โ โโโ compare_outputs.py
โ
โโโ export/ # Scripts to export LoRA/GGUF/merged weights
โ โโโ export_model.py
โ
โโโ outputs/ # Training checkpoints (excluded via .gitignore)
โ
โโโ utils/ # Utility scripts (regex, langid wrapper, etc.)
โ โโโ langid_utils.py
โ
โโโ requirements.txt # All dependencies listed here
โโโ main.py # Entry point for model training
โโโ README.md # To be added
๐ฆ Installation
pip install -r requirements.txt๐งโ๐ซ Usage
1. Prepare Dataset
Place your dataset in the data/ folder. The format should be:
[
{
"instruction": "Your prompt here",
"input": "Additional context if needed",
"output": "Expected reasoning response"
}
]2. Run Training
python main.py3. Inference
python inference/compare_outputs.py๐พ Export Options
Supports multiple formats:
merged_16bit/merged_4bitloraadapter-only savingggufexport for llama.cpp compatibility (q4_k_m, q5_k_m, q8_0, f16, etc.)
Run export/export_model.py and specify your desired export method.
๐ผ๏ธ Sample Output
| Prompt | Output Before Finetuning | Output After Finetuning |
|---|---|---|
| What is 17 + 25? | "Let me think... maybe 30?" | "The answer is 42." |
๐ Badges
๐ก Notes
- This pipeline is lightweight and suitable for most consumer GPUs (LoRA+16bit).
- GRPO allows the use of multiple reward functions to guide fine-tuning.
- To train on your own dataset, simply replace the JSON file in
data/and ensure it follows the structure shown above.
๐ License
MIT License.
๐โโ๏ธ Acknowledgements
๐จโ๐ผ Author
Elias Hossain
Machine Learning Researcher | PhD Student | AI x Reasoning Enthusiast
Happy fine-tuning! ๐ฏ