GitHunt
AR

ArmaanSethi/chess-fm

Foundation Model for Chess (of sorts)

ChessFM ๐Ÿง โ™Ÿ๏ธ

A 1.5B parameter model that plays chess by reasoning, not memorizing.

Status
Model
Target


๐Ÿ’ก The Idea

Most chess bots play by brute-force search. ChessFM plays by thinking out loud:

<think>
The opponent's queen threatens my f7 pawn.
If I castle now, I lose material.
Better to block with Nf6 first.
</think>
Nf6

The model explains why it's making a move โ€” like a chess tutor, not a calculator.


๐ŸŽฏ Goals

Metric Target
Elo Rating 1200+ (beat most LLMs)
Illegal Move Rate < 5%
Reasoning Format > 95% valid <think> tags

Benchmarks

Model Elo
GPT-4o ~1050
Gemini Pro ~1050
Claude Sonnet ~1000
ChessFM (target) 1200

๐Ÿ”ฌ Approach

Phase 1: SFT Bootstrap

Train on reasoning traces to teach the model chess fundamentals and <think> format.

Phase 2: Direct GRPO (Reinforcement Learning)

Train directly on chess games using verifiable rewards (legal/illegal, win/lose).
Stockfish provides the reward signal for curriculum learning.

Phase 3: Curriculum Learning

Progressive difficulty: Random โ†’ Stockfish L1 โ†’ Stockfish L3


๐Ÿ› ๏ธ Stack

Component Tool Purpose
Base Model Qwen-2.5-3B-Instruct Best format adherence in benchmarks
Training unsloth 2x faster, 60% less VRAM
Inference vLLM Fast game rollouts
Chess Engine Stockfish 16 Reward signal + validation
Hardware RTX 4090 (RunPod) 24GB VRAM

๐Ÿ“Š Training Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                             โ”‚
โ”‚   FEN Position                                              โ”‚
โ”‚        โ†“                                                    โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚  SFT on     โ”‚ โ†’   โ”‚   GRPO vs   โ”‚ โ†’   โ”‚   Elo       โ”‚  โ”‚
โ”‚   โ”‚  reasoning  โ”‚     โ”‚  Stockfish  โ”‚     โ”‚   Eval      โ”‚  โ”‚
โ”‚   โ”‚  traces     โ”‚     โ”‚  curriculum โ”‚     โ”‚   (500      โ”‚  โ”‚
โ”‚   โ”‚  (185 smpl) โ”‚     โ”‚             โ”‚     โ”‚   games)    โ”‚  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿš€ Bonus Features (After v1)

  • Socratic Structure โ€” Force reasoning into <threat_scan>, <candidates>, <verification> tags
  • Negative Data โ€” Train on mistake-then-correction examples
  • Puzzle Training โ€” Tactical curriculum from Lichess puzzles

๐Ÿ’ฐ Cost Estimate

Phase Time Cost
Setup & Baseline 4 hr $1.80
SFT Training 4 hr $1.80
GRPO Training 20 hr $9.00
Total ~28 hr ~$13

Yes, you can train a chess-playing LLM for the price of lunch.


๐Ÿ“š References


๐Ÿ“‹ Roadmap

See the full ChessFM Roadmap for detailed implementation steps.


๐Ÿ—‚๏ธ Project Structure

chess-fm/
โ”œโ”€โ”€ README.md                 # This file
โ”œโ”€โ”€ chess_fm_roadmap.md       # Detailed implementation plan
โ”œโ”€โ”€ requirements.txt          # All dependencies
โ”œโ”€โ”€ setup_env.sh              # Environment setup script
โ”‚
โ”œโ”€โ”€ data_generation/          # SFT data generation
โ”‚   โ”œโ”€โ”€ README.md
โ”‚   โ”œโ”€โ”€ fetch_elite_data.py   # Fetch FENs from Lichess
โ”‚   โ”œโ”€โ”€ download_positions.py # Generate diverse positions
โ”‚   โ”œโ”€โ”€ convert_to_training.py
โ”‚   โ”œโ”€โ”€ positions.txt         # 25k elite FENs
โ”‚   โ””โ”€โ”€ all_sft_data.jsonl    # 185 deduplicated samples
โ”‚
โ”œโ”€โ”€ training/                 # SFT training
โ”‚   โ”œโ”€โ”€ README.md
โ”‚   โ””โ”€โ”€ train_sft.py
โ”‚
โ”œโ”€โ”€ rl/                       # Reinforcement learning
โ”‚   โ”œโ”€โ”€ README.md
โ”‚   โ”œโ”€โ”€ chess_env.py          # Chess environment
โ”‚   โ”œโ”€โ”€ rewards.py            # Reward functions
โ”‚   โ”œโ”€โ”€ train_grpo.py         # GRPO training
โ”‚   โ””โ”€โ”€ tests/
โ”‚
โ”œโ”€โ”€ benchmarks/               # Evaluation & baselines
โ”‚   โ””โ”€โ”€ phase0/
โ”‚       โ”œโ”€โ”€ BASELINE_REPORT.md
โ”‚       โ”œโ”€โ”€ STRATEGY.md
โ”‚       โ”œโ”€โ”€ benchmark_models.py
โ”‚       โ””โ”€โ”€ run_benchmark_mlx.py
โ”‚
โ”œโ”€โ”€ scripts/                  # Utilities
โ”‚   โ”œโ”€โ”€ audit_tokenizer.py
โ”‚   โ””โ”€โ”€ download_models.py
โ”‚
โ””โ”€โ”€ tests/                    # Unit tests
    โ””โ”€โ”€ test_data_generation.py

๐Ÿ“„ License

MIT