HW 5 - Large Language Model

Team Members -

Start Training

pip install torch numpy transformers datasets tiktoken wandb tqdm pytorch-ignite

First generate the train and val data by running data/openwebtext/prepare.py. This script would fetch the OpenWebText data and perform a train-val split, followed by sub-word level tokenization using tiktoken. Finally it saves the process train and val data in the data/ folder.

$ python3 data/openwebtext/prepare.py

You can run the pretraining by simply running the train script. The configurations for the training can be set using the config.py file.

python3 train.py

python data/cnn_dailymail/prepare.py

python data/squad/prepare.py

Set the right file names and the required config variables in finetune_config.py and config.py. The other fields can be left untouched, but the file paths will need to be modified.
The model trained checkpoints can be downloaded from this directory - https://drive.google.com/drive/folders/13nobcjJdx2svWk4mJ8Xj_gO3p9V9I4AZ?usp=sharing

NanaoGPT - https://github.com/karpathy/nanoGPT/tree/master This repository was referred to for creating the LLM model.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
README.md		README.md
config.py		config.py
finetune.py		finetune.py
fintune_config.py		fintune_config.py
model.py		model.py
train.py		train.py