GPT-1

This repository contains a pytorch implementation of the GPT-1 model introduced
by OpenAI in the paper Improving Language Understanding with Unsupervised
Learning. This repository contains source code for the model as well as code
for preprocessing training data and the pre-training/fine-tuning process.

Setup

All required modules for running the code is in the env.yml file in the
confs directory. You can create the conda environment with:

conda env create -f confs/env.yml

Data Collection

The original BookCorpus data set used to pretrain GPT is no longer distributed.
However, this repository provides several resources for recreating or
downloading a similar data set.

Preprocessing

GPT uses byte pair encodings to tokenize. The preprocessing directory
contains a script for training a byte pair encoding tokenizer (train_bpe.py)
and another script for tokenizing a dataset using the trained tokenizer
(tokenize_dataset.py).

Training the tokenizer

The train_bpe.py script takes an input file containing a list of filepaths to
text files to be trained on. I used a randomly selected 10% sample of my
downloaded BooksCorpus dataset (about 1700 books). You can create the required
input file using the following command:

find [BookCorpus filepath]/epubtxt -iname "*.txt" | shuf | head -n 1700 >  files.txt

Then the tokenizer can be trained as follows:

python -m preprocessing.train_bpe -i files.txt \
                                  -o checkpoints/tokenizer \
                                  -m 40000 \
                                  -n 10

This trains a byte pair encoding tokenizer with 40000 merges, disregarding any
vocabulary words that appear with a frequency less than 10 in the dataset.

Tokenizing the dataset

The tokenize_dataset.py script takes in an input file containing a list of
filepaths to text files to tokenize. You can create the required input file
using the following command:

find [BookCorpus filepath]/epubtxt -iname "*.txt" >  files.txt

Next, tokenize the dataset:

python -m preprocessing.tokenize_dataset -c checkpoints/tokenizer \
                                         -i files.txt \
                                         -o data/pretrain/tokenized \
                                         -l 1024 \
                                         -j 8

This command tokenizes the files in files.txt and places the tokenized
versions in the tokenized directory. Lines are split into particular length
specified by the -l flag. During training, the TokenIDDataset class returns
random sequence-size segments of each line, so be sure to set your line length
to be greater than the sequence size you intend to use in your model instance.

Training

Both pre-training and fine-tuning can be performed with the train.py script.
If a checkpoint directory is specified with the -ch flag, training will
continue from that checkpoint.

The parameters I used for pretraining are in the pretrain.yml file in the
confs directory. All model parameters are the same as mentioned in Improving
Language Understanding with Unsupervised Learning, with the exception of
sequence size and batch size. I sequence size to 128 rather than 512, and batch
size to 32 rather than 64 in order to train on a single GPU.

Text Generation

Text generation is implemented using top-k sampling and can be performed with
the generate.py script. All generation parameters are located in the
generate.yml file in the confs folder.

akshat0123/GPT-1