This repo is an implementation of Andrej Karpathy's GPT tutorial series, packaged as a Python library. The core architecture and training approach follows Karpathy's excellent educational content, with some additional packaging and infrastructure work to make it more maintainable and reusable.
This project takes Karpathy's tutorial code and adds some additional infrastructure to make it more maintainable and testable:
- Package the implementation as a proper Python library with CLI tools
- Add type hints and documentation for better code clarity
- Include unit tests to ensure reliability
- Support various hardware configurations (MPS, older GPUs, Colab, etc)
- Implement efficient data loading and preprocessing for larger datasets
- Maintain the educational value while making it a little easier to use.
The core GPT-2 implementation and training methodology remains true to Karpathy's original work.
This project implements a GPT-2 style transformer model, following Karpathy's tutorial. This section contains a high level overview of the key components and any additions.
- Implements the standard GPT-2 architecture with transformer blocks containing:
- Multi-head causal self-attention
- Layer normalization
- MLP blocks with GELU activation
- Configurable model sizes matching OpenAI's GPT-2 variants (from S 124M to XL 1.5B) and even smaller XSS variants for testing.
- Shared input/output embeddings for parameter efficiency
- Support for loading pretrained GPT-2 weights from HuggingFace
- Efficient data loading with preprocessing and sharding:
- Pre-tokenizes datasets using tiktoken (GPT-2 tokenizer)
- Shards large datasets into manageable chunks
- Supports streaming for large datasets
- Implements gradient accumulation for effective larger batch sizes
- Learning rate scheduling with warmup
- Gradient clipping for stable training
- Support for different compute devices (CUDA, MPS, CPU)
- Model compilation for improved performance where available
- Built-in support for:
- TinyShakespeare (for testing and quick experiments)
- FineWebEdu (10B token educational dataset)
- HellaSwag (for model evaluation)
- Extensible dataset loading system using HuggingFace datasets
- Text generation with configurable parameters
- Model evaluation on benchmark datasets
- Support for different sampling strategies
My model trained with this library is available on huggingface at allenporter/gpt2
It is recommended to use virtual environments for installing python packages
such as uv
. Here is an example for how to install uv and create an environment:
$ sudo snap install astral-uv --classic
$ uv venv --python3.13
$ source .venv/bin/activate
The following sections give examples for how to setup the package either for using the command line or for local development.
This will install from the nano-gpt project from pypi:
$ uv pip install nano-gpt
These are the steps to use for local development. You'll first need to clone the git repo then you can install the local project:
$ uv pip install -r requirements_dev.txt
You may come across machines that already have pytorch installed, e.g. such as from a GPU hosting provider. You can create a virtual environment like this to preserve the system packages:
$ python3 -m venv venv --system-site-packages
$ source venv/bin/activate
$ pip install -r requirements_dev.txt
Another example is when using a jetson orin with the pytorch
container dustynv/pytorch:2.1-r36.2.0
you can setup with these commands:
$ apt install python3.10-venv
$ python3.10 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements_dev.txt
$ pip install "numpy<2"
$ pip install /opt/torch-2.1.0-cp310-cp310-linux_aarch64.whl
That will take about 8 days to train, by the way. I'd only use this for inference or for local testing.
Verify that you have the accelerator you expect:
$ python3
>>> import torch
>>> torch.cuda.is_available()
True
This example will download the pretrained gpt2 and sample from it with the given prefix:
$ nano-gpt sample --pretrained=gpt2 --seed=42 --max-length=30 "Hello, I'm a language model,"
> Hello, I'm a language model, which means I'm familiar with it, but I'm not fluent in that. Well, with that said,
> Hello, I'm a language model, and the syntax, to make use of it, is pretty good. So why do you have that and not
> Hello, I'm a language model, I'm doing this work in Python, and then I'm writing code for Haskell.
So we can
> Hello, I'm a language model, and you're making assumptions about my use of them. I'm not a natural language learner. I'm
> Hello, I'm a language model, well, I'm from Java and have to write a programming language for it. I have my own vocabulary because
This example will evaluate hellaswag against the pretrained gpt2:
$ nano-gpt --debug eval --pretrained=gpt2 --validation-steps=0
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 1.0000 | total: 1 | correct: 1
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.5000 | total: 2 | correct: 1
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.6667 | total: 3 | correct: 2
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.5000 | total: 4 | correct: 2
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.4000 | total: 5 | correct: 2
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.3333 | total: 6 | correct: 2
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.2857 | total: 7 | correct: 2
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.2500 | total: 8 | correct: 2
DEBUG:nano_gpt.hellaswag_eval:hellaswag: hellaswag: accuracy: 0.2222 | total: 9 | correct: 2
This will download the huggingface dataset into your ~/.cache
directory
which are 13 shard files about 2.15GB, totaling around 27GB of disk for the
entire dataset.
It will then tokenize the dataset to prepare to feed it into training and
store in ./dataset_cache
. This will create shard files with 100 million tokens
per shard. This is about 100 files for 10B total tokens. Each shard file is
about 100MB, so the total token cache is about 10GB on disk. We
don't load the entire dataset into RAM, but re-read each shard in each worker
as we iterate through the training dataset.
$ DATASET=finewebedu
$ nano-gpt prepare_dataset --dataset=${DATASET} --splits=train,validation
The tokenization step takes about 15 seconds per shard on a lambda labs beefy machine, so about 25 minutes in total. The process currently appears to be I/O bound reading/writing the shard, so possible future improvement there.
Checkpoints will be saved in checkpoints/
by default, every 5k steps.
This will train a new gpt2 125M parameter model using 0.5M step sizes (w/ gradient accumulation if needed) for 10B tokens.
nano-gpt train --dataset=finewebedu --device=cuda --sequence-length=1024 --micro-batch-size=16
This is the recipe used for a real training run with increased micro batch size that can fit on our beefy machine across 8 GPUs:
$ torchrun --standalone --nproc_per_node=8 `which nano-gpt` train --dataset=${DATASET} --micro-batch-size=32 --hellaswag_samples=250
Run appears to take 390ms per step. 10B tokens / 500k tokens per step = 20k steps 20k * 390ms = 2.16 hours
You can view the results of the training run using the notebook
script/logs_analysis.ipynb
to examine the training log results.
Note that in this training run, hellaswag was not evaluated on the entire dataset and so the results are higher than expected. I recommend adjusting the hellaswag evaluation steps to evaluate the entire dataset for future runs.
This is an example of sampling from the trained model checkpoint after 10B tokens.
$ nano-gpt sample --checkpoint=./checkpoint_019072.bin --device=mps
> Hello, I'm a language model, you're doing your application, I've put your main program and you want to model. Here are some things
> Hello, I'm a language model, so let's have a look at a few very old and popular dialects with some basic information about some of
> Hello, I'm a language model, but I also use a number of core vocabulary from the Python language and some data structures from
the web to
> Hello, I'm a language model, so this is about building a language to help my students to express themselves in all possible situations when they are in
> Hello, I'm a language model, who wrote my first 'hello' and never used it, but my first 'hello' can't be in
You can export a safetensors model from the pytorch checkpoint. This will create
a model.safetensors
file and config.json
in the export directory:
$ nano-gpt export --checkpoint ./checkpoint_019072.bin --export-dir export --device=cpu
$ jq '.n_ctx' export/config.json
1024
This export is written in the same format as the OpenAI GPT2 export and can be
used with the --pretrained
command line flag:
$ nano-gpt sample --pretrained=./export
> Hello, I'm a language model, you're doing your application, I've put your main program and you want to model. Here are some things
> Hello, I'm a language model, so let's have a look at a few very old and popular dialects with some basic information about some of
> Hello, I'm a language model, but I also use a number of core vocabulary from the Python language and some data structures from
the web to
> Hello, I'm a language model, so this is about building a language to help my students to express themselves in all possible situations when they are in
> Hello, I'm a language model, who wrote my first 'hello' and never used it, but my first 'hello' can't be in
You can upload your exported model to huggingface and use it like any other pretrained model.
$ huggingface-cli login
$ HUGGINGFACE_USER=`huggingface-cli whoami`
$ echo $HUGGINGFACE_USER
allenporter
$ huggingface-cli upload ${HUGGINGFACE_USER}/gpt2 export/ .
This will allow you to load the model from the huggingface repo:
$ nano-gpt sample --pretrained=allenporter/gpt2
> Hello, I'm a language model, you're doing your application, I've put your main program and you want to model. Here are some things
> Hello, I'm a language model, so let's have a look at a few very old and popular dialects with some basic information about some of
> Hello, I'm a language model, but I also use a number of core vocabulary from the Python language and some data structures from
the web to
> Hello, I'm a language model, so this is about building a language to help my students to express themselves in all possible situations when they are in
> Hello, I'm a language model, who wrote my first 'hello' and never used it, but my first 'hello' can't be in
This project is managed with scruft