Masoudjafaripour/nanochat-VLM
A minimal, hackable Vision-Language Model built on Karpathyβs nanochat β add image understanding and multimodal chat for under $200 in compute.
π§ nanochat-VLM
A minimal Vision-Language Model trained end-to-end in the spirit of Karpathyβs nanochat
nanochat-VLM extends the original nanochat idea toward multimodality: training a small but complete Vision-Language Model (VLM) with the same philosophyβminimal code, full ownership, and clear learning value (hopefully with similar cost of training).
The primary goal of this repo is to train a VLM end-to-end using nanochat-style scripts (tokenizer β LM β multimodal training), not just to demo inference.
A small set of notebooks is included as a secondary, exploratory path to understand and validate VLM components in isolation.
This project is designed for learning, research, and hackability, not scale.
π Goals
- Rebuild the core nanochat LLM pipeline (tokenization β transformer β training).
- Extend it to a Vision-Language Model using a vision encoder + fusion.
- Train the model script-first, nanochat-style (not notebook-first).
- Keep the system transparent, minimal, and easy to modify.
- Enable experimentation on limited compute (single GPU or small server).
π Learning & Training Milestones
Primary Path (Main Objective)
-
Tokenizer (Text)
- Train and evaluate a BPE tokenizer on large-scale text.
- Cache and reuse tokenizer artifacts across machines.
-
Base Language Model (Text-only)
- Train a GPT-style decoder-only transformer from scratch.
- Validate with loss curves and standard benchmarks.
-
Vision Encoder Integration
- Add a pretrained vision encoder (CLIP / ViT).
- Project visual features into the LM embedding space.
-
Multimodal Training
- Fuse image tokens with text tokens.
- Train a VLM end-to-end (nanochat-style scripts).
-
Multimodal Chat & Evaluation
- Enable image + text chat.
- Evaluate on small VLM datasets (captioning, VQA-style tasks).
Secondary Path (Notebooks β Supporting, Not Primary)
A small set of notebooks is provided in
π notebooks/
- inspect vision tokenization,
- test projector and fusion logic,
- debug multimodal forward passes,
- and run minimal alignment experiments.
These notebooks do not replace the main training pipeline and are not the primary goal of the project. They exist to support understanding and correctness before scaling training.
π§© Folder Structure
nanochat-VLM/
βββ dev/ # experiments, data prep, and utilities
βββ nanochat_vlm/ # core library (tokenizer, dataset, model components)
βββ rustbpe/ # Rust-based BPE tokenizer
βββ src/ # main training & evaluation scripts (primary path)
β βββ tok_train.py # tokenizer training
β βββ tok_eval.py # tokenizer evaluation
β βββ base_train.py # text-only LM pretraining
β βββ mid_train.py # chat & multimodal alignment
β βββ chat_sft.py # supervised fine-tuning
β βββ chat_eval.py # evaluation
β βββ chat_web.py # chat UI
βββ notebooks/ # secondary, exploratory VLM notebooks
β βββ vlm_token.ipynb
β βββ vlm_projector.ipynb
β βββ vlm_fusion.ipynb
β βββ ...
βββ tests/ # small unit tests
βββ speedrun.sh # end-to-end nanochat-style training script
βββ README.md # project overview
π οΈ Setup
# create virtual environment
python -m venv vlm-venv
# activate (Windows)
.\vlm-venv\Scripts\Activate.ps1
# install dependencies - will be added soon
pip install -r requirements.txtβοΈ License
MIT License
Copyright (c) 2025 Masoud Jafaripour
Based on work by Andrej Karpathy (nanochat, MIT License)
π Vision
Train a small but real VLM, end-to-end, with code simple enough to understand and modifyβ
multimodality made teachable, not magical.