ChessFM ๐ง โ๏ธ
A 1.5B parameter model that plays chess by reasoning, not memorizing.
๐ก The Idea
Most chess bots play by brute-force search. ChessFM plays by thinking out loud:
<think>
The opponent's queen threatens my f7 pawn.
If I castle now, I lose material.
Better to block with Nf6 first.
</think>
Nf6The model explains why it's making a move โ like a chess tutor, not a calculator.
๐ฏ Goals
| Metric | Target |
|---|---|
| Elo Rating | 1200+ (beat most LLMs) |
| Illegal Move Rate | < 5% |
| Reasoning Format | > 95% valid <think> tags |
Benchmarks
| Model | Elo |
|---|---|
| GPT-4o | ~1050 |
| Gemini Pro | ~1050 |
| Claude Sonnet | ~1000 |
| ChessFM (target) | 1200 |
๐ฌ Approach
Phase 1: SFT Bootstrap
Train on reasoning traces to teach the model chess fundamentals and <think> format.
Phase 2: Direct GRPO (Reinforcement Learning)
Train directly on chess games using verifiable rewards (legal/illegal, win/lose).
Stockfish provides the reward signal for curriculum learning.
Phase 3: Curriculum Learning
Progressive difficulty: Random โ Stockfish L1 โ Stockfish L3
๐ ๏ธ Stack
| Component | Tool | Purpose |
|---|---|---|
| Base Model | Qwen-2.5-3B-Instruct | Best format adherence in benchmarks |
| Training | unsloth | 2x faster, 60% less VRAM |
| Inference | vLLM | Fast game rollouts |
| Chess Engine | Stockfish 16 | Reward signal + validation |
| Hardware | RTX 4090 (RunPod) | 24GB VRAM |
๐ Training Pipeline
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ FEN Position โ
โ โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ SFT on โ โ โ GRPO vs โ โ โ Elo โ โ
โ โ reasoning โ โ Stockfish โ โ Eval โ โ
โ โ traces โ โ curriculum โ โ (500 โ โ
โ โ (185 smpl) โ โ โ โ games) โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Bonus Features (After v1)
- Socratic Structure โ Force reasoning into
<threat_scan>,<candidates>,<verification>tags - Negative Data โ Train on mistake-then-correction examples
- Puzzle Training โ Tactical curriculum from Lichess puzzles
๐ฐ Cost Estimate
| Phase | Time | Cost |
|---|---|---|
| Setup & Baseline | 4 hr | $1.80 |
| SFT Training | 4 hr | $1.80 |
| GRPO Training | 20 hr | $9.00 |
| Total | ~28 hr | ~$13 |
Yes, you can train a chess-playing LLM for the price of lunch.
๐ References
- GRPO Paper โ Our RL algorithm
- Qwen2.5-Math โ Base model architecture
- DeepSeek-R1 โ Self-correction patterns
- Dynomight Chess โ Regurgitation technique
๐ Roadmap
See the full ChessFM Roadmap for detailed implementation steps.
๐๏ธ Project Structure
chess-fm/
โโโ README.md # This file
โโโ chess_fm_roadmap.md # Detailed implementation plan
โโโ requirements.txt # All dependencies
โโโ setup_env.sh # Environment setup script
โ
โโโ data_generation/ # SFT data generation
โ โโโ README.md
โ โโโ fetch_elite_data.py # Fetch FENs from Lichess
โ โโโ download_positions.py # Generate diverse positions
โ โโโ convert_to_training.py
โ โโโ positions.txt # 25k elite FENs
โ โโโ all_sft_data.jsonl # 185 deduplicated samples
โ
โโโ training/ # SFT training
โ โโโ README.md
โ โโโ train_sft.py
โ
โโโ rl/ # Reinforcement learning
โ โโโ README.md
โ โโโ chess_env.py # Chess environment
โ โโโ rewards.py # Reward functions
โ โโโ train_grpo.py # GRPO training
โ โโโ tests/
โ
โโโ benchmarks/ # Evaluation & baselines
โ โโโ phase0/
โ โโโ BASELINE_REPORT.md
โ โโโ STRATEGY.md
โ โโโ benchmark_models.py
โ โโโ run_benchmark_mlx.py
โ
โโโ scripts/ # Utilities
โ โโโ audit_tokenizer.py
โ โโโ download_models.py
โ
โโโ tests/ # Unit tests
โโโ test_data_generation.py
๐ License
MIT