neemiasbsilva/developing-nanoGPT2-fineweb
Developing a cusstom nano GPT-2 from scratch using PyTorch on the Fineweb dataset.
Developing a nano GPT-2 from scratch using PyTorch and training using the Fineweb dataset
Table of contents
About
Developing a custom nano GPT-2 from scratch using PyTorch an train in the EduFineWeb dataset. This repository was based on reproduce the Open AI GPT-2 paper and using the hyper-parameters for trianing from Open AI GPT-3 paper. The dataset used was the FineWeb ๐ท (the smalest version around 10B gpt2 number of tokens).
Example of the dataset used for the train and evaluation phase. For more details about the dataset you can visite the HuggingFace FineWEB.
Note: This experiments was based on Andrej Karpathy works, called nano GPT.
Project Organization
โโโ LICENSE <- Open-source license if one is chosen
โโโ Makefile <- Makefile with convenience commands like `make data` or `make train`
โโโ README.md <- The top-level README for developers using this project.
โโโ data
โ โโโ external <- Data from third party sources.
โ โโโ interim <- Intermediate data that has been transformed.
โ โโโ processed <- The final, canonical data sets for modeling.
โ โโโ raw <- The original, immutable data dump.
โ
โโโ docs <- A default mkdocs project; see mkdocs.org for details
โ
โโโ models <- Trained and serialized models, model predictions, or model summaries
โ
โโโ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
โ
โ `1.0-nbs-initial-data-exploration`.
โ
โโโ pyproject.toml <- Project configuration file with package metadata for custom_nanogpt2_fineweb
โ and configuration for tools like black
โ
โโโ references <- Data dictionaries, manuals, and all other explanatory materials.
โ
โโโ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
โ โโโ figures <- Generated graphics and figures to be used in reporting
โ
โโโ requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
โ generated with `pip freeze > requirements.txt`
โ
โโโ setup.cfg <- Configuration file for flake8
โ
โโโ src <- Source code for use in this project.
โ
โโโ __init__.py <- Makes src a Python module
โ
โโโ data <- Scripts to manager the datta
โ โโโ manager_data.py
โ
โโโ configs <- Get configs for data, train and GPT model
โ |โโ setup.py
โ โโโ config.yaml
โ
โโโ model <- Scripts to build the GPT-2 model
โ |โโ transformer_blocks.py
โ โโโ gpt2_model.py
โ
โโโ train.py <- Scripts to train the GPT-2 model
โโโ generate.py <- Scripts to generate answers from the GPT-2 custom
trained model
Train Resources
The training process was conducted using a robust setup, featuring a system equipped with four NVIDIA GeForce RTX 3090 GPUs. The computational power was further supported by an Intelยฎ Coreโข i7-10700 CPU running at 2.90 GHz, complemented by a substantial 130 GB of RAM.
Usage for text generation
Clone the repository and create a conda environment:
conda env create --name envname --file=environments.yml
Download the model file available in the link below and put in the follow path models/ (download this models checkpoint).
After that, open the file config_inf.yaml (src/config/config_inference.yaml) and choose the message you want (e.g message: "Hello GPT, can you explain what is machine learning?")
And finally, for run the inference (don't need a GPU for run), just type this command:
python generate.py
The text generation will be stored in reports/generation.json:
