Train Transformers

Key Features

This repository is used to train a chess-playing transformer on UCI move sequences. It provides some very nice features:

Automated setup with Makefile
Distributed training using Pytorch Lightning
Containerized environment for fast-deployment via Docker or Apptainer
Configuration management via Hydra
Training monitoring via Weights & Biases
Start/stop/error notifications via Ntfy.sh, plus the ability to interrupt training remotely via notify.sh

Prerequisites

GNU Make (shortcut command for building/running the container)
Docker Engine (Installation Guide): Follow installation steps for your Linux distribution, or use Docker Desktop for Windows
NVIDIA Container Toolkit (Installation Guide): Follow steps for Installation and Configuration

Move Docker's default data-dir (Only if neeeded)

On my system, I have a lot of free space at /home, but very little in docker's default directory. Run the following commands to update Docker to store its data in a different directory.

Shutdown Docker service

sudo systemctl stop docker docker.socket
sudo systemctl status docker

Move data to the new path (if it's not already there)

sudo mkdir -p /etc/docker
sudo rsync -avxP /var/lib/docker/ /home/docker/
echo '{
  "data-root": "/home/docker"
}' | sudo tee /etc/docker/daemon.json

Restart the Docker services
```
sudo systemctl restart docker
```

Useful links:

Installation

Clone the Repository First, clone the repository into your desired project directory:

git clone https://github.com/austinleedavis/train-transformer.git
cd train-transformer

Build the Docker Container To set up your development environment, build the Docker container with all required dependencies:
```
make docker-build
```
Once the build completes, you have multiple options for accessing the environment.

Accessing the Development Environment

Post-build, there are three options to access the development environment.

Option 1: VS Code DevContainer

If you use VS Code, you can work inside the container with Dev Containers:

Details...

Install the Dev Containers extension.
Open the project in VS Code.
Open the command palette (Ctrl+Shift+P / Cmd+Shift+P) and select:
```
Dev Containers: Reopen in Container
```

This will start a development session inside the Docker environment.

Option 2: Running Scripts with Docker

You can send non-interactive scripts to be executed inside the Docker container.

Details...

Run:

docker run --rm -v $(pwd):/workspace $(basename $(pwd)):latest bash -c "./scripts/train.sh"

Option 3: Using Apptainer (For Cluster Environments)

For systems that use Apptainer instead of Docker (e.g., managed HPC clusters), it's easiest to push the container image to Docker Hub (requires an account), then pull it to the cluster.

Details...

Follow these steps:

Allocate a Compute Node (if required) Some clusters require an allocation before running GPU workloads:
```
salloc --time=1:00:00 --gres=gpu:1
```
Once granted, note the assigned node and connect to it:
```
ssh <assigned_node>
```
Load Required Modules. Ensure Apptainer and CUDA are available:
```
module load apptainer
module load cuda/cuda-12.4.0
```
Pull the Container Image To use your Docker container with Apptainer, first push it to Docker Hub (requires an account). Then, pull it onto the cluster:
```
apptainer pull docker://<your_username>/train-transformer
```
Run the Container and Check GPU Access
```
apptainer run --nv ~/containers/train-transformer_Latest.sif
```
Once inside the container (you should see an Apptainer> prompt), verify GPU availability:
```
Apptainer> nvidia-smi
```
If the GPUs are recognized, you're all set!

Environment Variables

You should create a .env to save several environment variables. For example:

WANDB_API_KEY=... # alternatively, log in to wandb within the container.
NTFY_TOPIC=<your_topic_here> # the topic to which you will publish/subscribe notifications
HYDRA_CONFIG_PATH=configs # the path to your hydra configurations. (Best practice: use a config folder outside the git repository)

Project Structure

.
├── configs/      
|   └── ...       # configuration templates. Check train.yaml first
├── scripts/      
|   └── ...       # bash/slurm scripts to facilitate job execution
├── src/          
|   └── ...       # base classes and code for the project
├── Dockerfile 
├── Makefile
├── pyproject.toml
├── README.md
└── requirements.txt

Using Ntfy

Motivation: Training is a peculiar thing. Sometimes it goes great, and sometimes it doesn't. Unfortunately, when deploying jobs to a compute cluster, I'm rarely sitting at the terminal monitoring training progress. Instead, my job is typically not considered "high-priority", and I must wait minutes or hours before it starts. When the job does start, I want to know ASAP so I can monitor progress and quickly re-submit a job if it failed. This is why I integrated the NtfyCallback into my training scripts.

If you set the NTFY_TOPIC environment variable, the training script will send a notification to that topic when training starts, stops, or when an error occurs. (See the examples below.) Additionally, if you use the Weights and Biases logger, clicking the notification will include the URL to the wandb run.

Start, Finish, and Error notifications received on the Android app.

You can also interrupt the training process at any time using Ntfy as a sort of manual form of early stopping. When training begins, the NtfyCallback spins-up a thread and subscribes to your NTFY_TOPIC. To stop a run at any moment, send the hh-mm-ss timestamp from the run name (e.g., "16-29-15" in the image above) to your NTFY_TOPIC via your phone or the desktop interface, and the monitoring thread will signal to Pytorch Lightning that it should stop the training process.

Notes

Configurations: Modify configs/train.yaml to adjust training settings and paths.
Logs & Checkpoints: Stored in outputs/ folder organized by date/time of each run.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.devcontainer		.devcontainer
configs		configs
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
ntfy.png		ntfy.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Train Transformers

Key Features

Prerequisites

Useful links:

Installation

Accessing the Development Environment

Option 1: VS Code DevContainer

Option 2: Running Scripts with Docker

Option 3: Using Apptainer (For Cluster Environments)

Environment Variables

Project Structure

Using Ntfy

Notes

About

Releases

Packages

Languages

austinleedavis/train-transformer

Folders and files

Latest commit

History

Repository files navigation

Train Transformers

Key Features

Prerequisites

Useful links:

Installation

Accessing the Development Environment

Option 1: VS Code DevContainer

Option 2: Running Scripts with Docker

Option 3: Using Apptainer (For Cluster Environments)

Environment Variables

Project Structure

Using Ntfy

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages