GitHunt
EL

eliashossain001/grpo-finetune-deepseek-qwen3

GRPO Fine-Tuning on DeepSeek-R1-0528-Qwen3-8B

This repository provides an end-to-end lightweight fine-tuning pipeline using GRPO (Guided Reward Preference Optimization) on the DeepSeek-R1-0528-Qwen3-8B model, enhanced with Unsloth for parameter-efficient training.

Features

  • GRPO training with custom reward functions
  • Parameter-efficient fine-tuning (LoRA) using Unsloth
  • Inference-ready with GGUF/merged/int4 export support
  • Model evaluation via prompt completion comparisons
  • Easily adaptable to your own dataset (just replace data/train.json with the required format)

๐Ÿ“ Project Structure

reasoning_finetune/
โ”œโ”€โ”€ configs/                   # Configuration scripts for GRPO training
โ”‚   โ””โ”€โ”€ grpo_config.py         
โ”‚
โ”œโ”€โ”€ data/                      # Dataset handling and preprocessing
โ”‚   โ””โ”€โ”€ preprocess.py
โ”‚
โ”œโ”€โ”€ model/                     # Model loading, tokenizer setup, and save utilities
โ”‚   โ”œโ”€โ”€ load_model.py
โ”‚   โ””โ”€โ”€ save_model.py
โ”‚
โ”œโ”€โ”€ rewards/                   # Reward functions for GRPO
โ”‚   โ”œโ”€โ”€ answer_check.py
โ”‚   โ”œโ”€โ”€ format_check.py
โ”‚   โ””โ”€โ”€ language_check.py
โ”‚
โ”œโ”€โ”€ train_module/              # Training logic
โ”‚   โ””โ”€โ”€ train.py
โ”‚
โ”œโ”€โ”€ inference/                 # Inference and comparison script
โ”‚   โ””โ”€โ”€ compare_outputs.py
โ”‚
โ”œโ”€โ”€ export/                    # Scripts to export LoRA/GGUF/merged weights
โ”‚   โ””โ”€โ”€ export_model.py
โ”‚
โ”œโ”€โ”€ outputs/                   # Training checkpoints (excluded via .gitignore)
โ”‚
โ”œโ”€โ”€ utils/                     # Utility scripts (regex, langid wrapper, etc.)
โ”‚   โ””โ”€โ”€ langid_utils.py
โ”‚
โ”œโ”€โ”€ requirements.txt           # All dependencies listed here
โ”œโ”€โ”€ main.py                    # Entry point for model training
โ””โ”€โ”€ README.md                  # To be added

๐Ÿ“ฆ Installation

pip install -r requirements.txt

๐Ÿง‘โ€๐Ÿซ Usage

1. Prepare Dataset

Place your dataset in the data/ folder. The format should be:

[
  {
    "instruction": "Your prompt here",
    "input": "Additional context if needed",
    "output": "Expected reasoning response"
  }
]

2. Run Training

python main.py

3. Inference

python inference/compare_outputs.py

๐Ÿ’พ Export Options

Supports multiple formats:

  • merged_16bit / merged_4bit
  • lora adapter-only saving
  • gguf export for llama.cpp compatibility (q4_k_m, q5_k_m, q8_0, f16, etc.)

Run export/export_model.py and specify your desired export method.

๐Ÿ–ผ๏ธ Sample Output

Prompt Output Before Finetuning Output After Finetuning
What is 17 + 25? "Let me think... maybe 30?" "The answer is 42."

๐Ÿ“Š Badges

Model
Framework
License
GPU Used


๐Ÿ’ก Notes

  • This pipeline is lightweight and suitable for most consumer GPUs (LoRA+16bit).
  • GRPO allows the use of multiple reward functions to guide fine-tuning.
  • To train on your own dataset, simply replace the JSON file in data/ and ensure it follows the structure shown above.

๐Ÿ“œ License

MIT License.


๐Ÿ™‹โ€โ™‚๏ธ Acknowledgements


๐Ÿ‘จโ€๐Ÿ’ผ Author

Elias Hossain
Machine Learning Researcher | PhD Student | AI x Reasoning Enthusiast

GitHub

Happy fine-tuning! ๐ŸŽฏ

eliashossain001/grpo-finetune-deepseek-qwen3 | GitHunt