rachittshah/mlx-finetuning
Production-ready framework for fine-tuning LLMs on Apple Silicon using MLX and LoRA
MLX Fine-Tuning
A production-ready framework for fine-tuning Large Language Models on Apple Silicon using MLX and LoRA (Low-Rank Adaptation).
Overview
This repository provides tools and examples for efficient LLM fine-tuning on Apple Silicon devices using:
- MLX Framework: Apple's machine learning framework optimized for Apple Silicon
- LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning technique
- Production-grade evaluation: LLM-as-a-Judge benchmarking system
Features
- Parameter-efficient fine-tuning with LoRA
- Support for models from 0.5B to 7B+ parameters
- Production-ready evaluation framework
- Memory-optimized training with gradient checkpointing
- Comprehensive documentation and examples
Quick Start
Installation
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install mlx-lm numpy scipy scikit-learn tqdmBasic Usage
# Download a model
mlx_lm.convert --hf-path mlx-community/Qwen2.5-0.5B-Instruct-4bit -q
# Fine-tune with LoRA
python -m mlx_lm.lora \
--model mlx_model \
--train \
--data data/ \
--adapter-path adapters/my-model \
--iters 1000 \
--batch-size 4 \
--learning-rate 1e-5
# Generate with fine-tuned model
mlx_lm.generate --model mlx_model --adapter-path adapters/my-model \
--prompt "Your prompt here"Understanding LoRA
What is LoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that dramatically reduces the computational and memory requirements for adapting large language models to specific tasks.
How LoRA Works
Instead of fine-tuning all model parameters, LoRA injects trainable low-rank matrices into the model's architecture. Specifically, for a pre-trained weight matrix W, LoRA represents the weight update as:
W' = W + BA
Where:
Wis the frozen pre-trained weight matrix (dimensions d × k)BandAare trainable low-rank matrices (dimensions d × r and r × k)ris the rank, typically r << min(d, k)
Key Benefits
-
Memory Efficiency: Only trains ~0.1-3% of parameters
- 7B model: ~100M trainable params vs 7B frozen params
- Reduces memory footprint by 3-10x
-
Training Speed: Faster training due to fewer parameters
- Reduced gradient computation
- Less memory movement
- Faster convergence on small datasets
-
Storage: Adapters are tiny (10-100MB vs multi-GB models)
- Easy to version and share
- Multiple task-specific adapters for one base model
-
Modularity: Swap adapters without reloading base model
- One base model + multiple task adapters
- No catastrophic forgetting
LoRA Parameters
Rank (r):
- Controls adapter capacity
- Typical values: 8, 16, 32, 64
- Higher rank = more capacity but more parameters
- Rule of thumb: Start with r=16
Alpha (α):
- Scaling factor for LoRA updates
- Typical values: 16, 32, 64
- Common pattern: α = 2r
- Controls how much LoRA influences the model
Target Modules:
- Which layers to apply LoRA
- Common: Query, Key, Value, Dense layers
- More modules = more parameters but better adaptation
Example Configuration
# Low-resource setting (16GB RAM)
rank = 8
alpha = 16
target_modules = ["query", "value"] # ~0.5% parameters
# Balanced setting (32GB RAM)
rank = 16
alpha = 32
target_modules = ["query", "key", "value", "dense"] # ~1-2% parameters
# High-capacity setting (64GB+ RAM)
rank = 64
alpha = 128
target_modules = ["query", "key", "value", "dense", "mlp"] # ~3-5% parametersWhen to Use LoRA
Best For:
- Task-specific adaptation (classification, QA, summarization)
- Limited compute resources
- Multiple task variants from one base model
- Quick experimentation
Not Ideal For:
- Teaching completely new knowledge (use full fine-tuning)
- Dramatically changing model behavior
- When you have unlimited compute
Documentation
For comprehensive guides and best practices, see:
- MLX Fine-Tuning Research Guide - Complete research guide covering training configurations, loss curves, optimization strategies, and production deployment
- Embedding LoRA Guide - Specialized guide for fine-tuning embedding models with LoRA for semantic search and retrieval tasks
- Walmart-Amazon Training Guide - Real-world product matching use case with complete training pipeline
Project Structure
mlx-finetuning/
├── README.md # This file
├── CONTRIBUTING.md # Contributing guidelines
├── requirements.txt # Python dependencies
├── docs/
│ ├── MLX_FINETUNING_RESEARCH_GUIDE.md # Research guide & best practices
│ ├── EMBEDDING_LORA_GUIDE.md # Embedding model fine-tuning
│ └── WALMART_AMAZON_TRAINING_GUIDE.md # Product matching use case
├── scripts/
│ ├── run_finetune.py # Main fine-tuning script
│ ├── download_model.py # Download MLX models
│ ├── download_judgelm.py # Download JudgeLM dataset
│ ├── evaluate_judge_model.py # Evaluate judge models
│ ├── evaluate_model_detailed.py # Detailed evaluation
│ ├── llm_judge_benchmark.py # LLM-as-a-Judge benchmark
│ ├── train_embedding_lora.py # Embedding-specific training
│ ├── upload_to_hf.py # Upload to HuggingFace
│ ├── compare_models.py # Model comparison
│ └── test_judge.py # Test judge outputs
├── examples/
│ └── quick-start.sh # Quick start example
└── data/
└── README.md # Data format documentation
Training Guide
Data Format
Training data should be in JSONL format with conversational structure:
{"messages": [
{"role": "user", "content": "Question or prompt"},
{"role": "assistant", "content": "Expected response"}
]}Hyperparameter Guidelines
Learning Rate:
- Small models (0.5-1B): 1e-5 to 5e-5
- Medium models (3-7B): 1e-5 to 2e-5
- Large models (13B+): 5e-6 to 1e-5
Batch Size:
- Limited by memory
- 0.5B models: 16-64
- 3-7B models: 4-16
- Adjust based on available RAM
Iterations:
- Small datasets (<1K): 500-1000 iterations
- Medium datasets (1-10K): 1000-3000 iterations
- Large datasets (10K+): 3000-10000 iterations
Sequence Length:
- Shorter = faster training, less memory
- Balance task requirements vs resources
- Typical: 512 (fast), 1024 (balanced), 2048 (long context)
Memory Optimization
Gradient Checkpointing:
--grad-checkpoint # Trades compute for memoryReduce Batch Size:
--batch-size 4 # Start small, increase if memory allowsLimit Layers:
--num-layers 8 # Only fine-tune last N layersShorter Sequences:
--max-seq-length 512 # Reduce if hitting OOMEvaluation
Statistical Metrics
Basic accuracy and loss metrics are computed during training and validation.
LLM-as-a-Judge
For evaluating generative quality, we provide an LLM-as-a-Judge framework:
python scripts/llm_benchmark.py \
--judge-model your-model \
--judge-adapter your-adapters \
--evaluator-model reference-model \
--test-data data/test.jsonlThis evaluates:
- Response quality and coherence
- Task-specific performance
- Reasoning ability
- Overall model utility
Examples
Fine-tuning a Judge Model
# Download model
mlx_lm.convert --hf-path mlx-community/Qwen2.5-0.5B-Instruct-4bit -q
# Train
python -m mlx_lm.lora \
--model mlx_model \
--train \
--data data/judgelm \
--adapter-path adapters/judge \
--iters 2000 \
--batch-size 16 \
--learning-rate 2e-5 \
--num-layers 8
# Evaluate
python scripts/llm_benchmark.py \
--judge-model mlx_model \
--judge-adapter adapters/judge \
--evaluator-model mlx_model \
--max-samples 100Multi-Task Adaptation
# Train task-specific adapters
python -m mlx_lm.lora --data task1/ --adapter-path adapters/task1 --train
python -m mlx_lm.lora --data task2/ --adapter-path adapters/task2 --train
# Use different adapters with same base model
mlx_lm.generate --model base --adapter-path adapters/task1 --prompt "..."
mlx_lm.generate --model base --adapter-path adapters/task2 --prompt "..."Supported Models
Any model from the MLX Community on Hugging Face is supported:
- Qwen2.5: 0.5B, 1.5B, 3B, 7B
- Gemma: 2B, 7B
- Phi: 2B, 3B
- SmolLM: 135M, 360M, 1.7B
- Llama: Various sizes
Find models at: https://huggingface.co/mlx-community
Performance Tips
Training Speed
- Reduce sequence length: Shorter sequences train faster
- Increase batch size: Better GPU utilization (if memory allows)
- Use fewer layers:
--num-layers 4for quick experiments - Disable progress bars:
--verbose Falsefor batch jobs
Memory Optimization
- Enable gradient checkpointing:
--grad-checkpoint - Reduce batch size: Start with 4, increase gradually
- Limit trainable layers:
--num-layers 8 - Use 4-bit quantized models: Significantly reduces memory
Quality Improvements
- More training iterations: Allow model to converge
- Higher rank:
--lora-rank 32for more capacity - Learning rate tuning: Try 5e-6, 1e-5, 2e-5, 5e-5
- More training data: Quality and quantity matter
Troubleshooting
Out of Memory (OOM)
# Try these in order:
1. --grad-checkpoint
2. --batch-size 4
3. --num-layers 4
4. --max-seq-length 512
5. Use smaller base modelPoor Performance
# Diagnose:
1. Check loss curve (should decrease steadily)
2. Evaluate on validation set frequently
3. Try different learning rates
4. Ensure data quality and format
5. Increase training iterationsSlow Training
# Speed up:
1. Reduce --max-seq-length
2. Increase --batch-size (if memory allows)
3. Use --num-layers for quick experiments
4. Remove --grad-checkpoint if memory allowsContributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure code follows style guidelines
- Submit a pull request
License
MIT License - see LICENSE file for details
Citation
If you use this framework in your research, please cite:
@software{mlx_finetuning,
title = {MLX Fine-Tuning: Production-Ready LoRA Training for Apple Silicon},
year = {2025},
url = {https://github.com/rachittshah/mlx-finetuning}
}References
- LoRA Paper - Hu et al., 2021
- MLX Framework
- MLX-LM
- Parameter-Efficient Fine-Tuning
Acknowledgments
Built with:
- MLX by Apple
- MLX-LM by the MLX community
- LoRA technique by Microsoft Research