GitHunt
ME

A mini gpt style language model developed from wikipedia content

Mini GPT-Style Language Model

This repository contains an end-to-end implementation of a mini GPT-style language model built from scratch. It covers dataset preprocessing, tokenizer training, model design, and evaluation, showcasing the fundamentals of autoregressive text generation using Transformers.

๐Ÿ“Œ Project Overview

  • Implemented a causal language model (CLM) based on the Transformer decoder architecture.
  • Trained a custom Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 20,000.
  • Dataset: 100,000 cleaned English Wikipedia articles (โ‰ˆ861K training sequences).
  • Frameworks: TensorFlow/Keras, Keras-NLP, Hugging Face Tokenizers.
  • Training environment: Google Colab (for prototyping) + Kaggle GPUs (for full training).
  • Achieved validation perplexity ~100 after 2 epochs (~13M parameters model).

๐Ÿš€ Features

  • Data preprocessing: removes wiki markup, cleans text, splits into paragraphs.
  • Tokenizer training: trains BPE tokenizer from scratch (20k vocab, BOS/PAD/UNK tokens).
  • Dataset preparation: converts text to fixed-length token sequences for CLM training.
  • Model architecture:
    • 4 Transformer decoder layers
    • 256-dim embeddings, 4 attention heads, 1024 feedforward dim
    • ~13.45M parameters
  • Evaluation:
    • Quantitative: loss & perplexity
    • Qualitative: generated text samples
  • Multi-GPU support via tf.distribute.MirroredStrategy.

๐Ÿ“‚ Repository Structure

โ”œโ”€โ”€ mini_gpt.ipynb               # Jupyter Notebook with full implementation
โ”œโ”€โ”€ mini_gpt_documentation.pdf   # Project report with detailed explanation
โ””โ”€โ”€ README.md                    # Project description and usage guide

๐Ÿ“Š Results

  • Loss (Epoch 2): 4.61
  • Perplexity: 100.76
  • Generated Examples:
    • Prompt: "The future of AI"
      Output: repetitive but syntactically fluent text.
    • Prompt: "Albert Einstein was"
      Output: semantically plausible sentences with named entities.

๐Ÿ” Key Learnings

  • Demonstrated the fundamentals of building GPT-like models from scratch.
  • Showed trade-offs between model size vs. compute resources.
  • Highlighted importance of tokenization, dataset quality, and decoding strategies.

๐Ÿ› ๏ธ Setup & Usage

  1. Clone this repo
    git clone https://github.com/meghanaNanuvala/mini-gpt.git
    cd mini-gpt
  2. Open and run the notebook
    jupyter notebook mini_gpt.ipynb
    
    

๐Ÿ”ฎ Future Enhancements

  • Increase model depth (6โ€“8 layers) and embedding dimensions for richer representations.
  • Train for more epochs to improve coherence and reduce repetition.
  • Use advanced decoding strategies (Top-k / Top-p sampling) instead of greedy decoding.
  • Add dropout/normalization layers for better generalization.
  • Experiment with diverse datasets (dialogue, narratives, QA) for more creative generations.
  • Explore relative positional encoding or longer sequence lengths.

๐Ÿ“š References

  • Vaswani et al. (2017). Attention is All You Need.
  • OpenAI (2020). Language Models are Few-Shot Learners.
  • Hugging Face Datasets & Tokenizers
  • Keras-NLP Documentation
  • TensorFlow MirroredStrategy

โœจ Author

Meghana Nanuvala

Master's student in Computer Science, Indiana University
Research interests: AI/ML, Generative Models, Agentic AI, Full Stack Software Engineer