Skip to content

austinleedavis/train-transformer

Repository files navigation

Train Transformers

Key Features

This repository is used to train a chess-playing transformer on UCI move sequences. It provides some very nice features:

  • Automated setup with Makefile
  • Distributed training using Pytorch Lightning
  • Containerized environment for fast-deployment via Docker or Apptainer
  • Configuration management via Hydra
  • Training monitoring via Weights & Biases
  • Start/stop/error notifications via Ntfy.sh, plus the ability to interrupt training remotely via notify.sh

Prerequisites

  1. GNU Make (shortcut command for building/running the container)
  2. Docker Engine (Installation Guide): Follow installation steps for your Linux distribution, or use Docker Desktop for Windows
  3. NVIDIA Container Toolkit (Installation Guide): Follow steps for Installation and Configuration
Move Docker's default data-dir (Only if neeeded)

On my system, I have a lot of free space at /home, but very little in docker's default directory. Run the following commands to update Docker to store its data in a different directory.

  1. Shutdown Docker service

    sudo systemctl stop docker docker.socket
    sudo systemctl status docker
  2. Move data to the new path (if it's not already there)

    sudo mkdir -p /etc/docker
    sudo rsync -avxP /var/lib/docker/ /home/docker/
    echo '{
      "data-root": "/home/docker"
    }' | sudo tee /etc/docker/daemon.json
  3. Restart the Docker services

    sudo systemctl restart docker

Useful links:

Installation

  1. Clone the Repository First, clone the repository into your desired project directory:

    git clone https://github.com/austinleedavis/train-transformer.git
    cd train-transformer
  2. Build the Docker Container To set up your development environment, build the Docker container with all required dependencies:

    make docker-build

    Once the build completes, you have multiple options for accessing the environment.

Accessing the Development Environment

Post-build, there are three options to access the development environment.

Option 1: VS Code DevContainer

If you use VS Code, you can work inside the container with Dev Containers:

Details...
  1. Install the Dev Containers extension.
  2. Open the project in VS Code.
  3. Open the command palette (Ctrl+Shift+P / Cmd+Shift+P) and select:
    Dev Containers: Reopen in Container
    

This will start a development session inside the Docker environment.

Option 2: Running Scripts with Docker

You can send non-interactive scripts to be executed inside the Docker container.

Details...

Run:

docker run --rm -v $(pwd):/workspace $(basename $(pwd)):latest bash -c "./scripts/train.sh"

Option 3: Using Apptainer (For Cluster Environments)

For systems that use Apptainer instead of Docker (e.g., managed HPC clusters), it's easiest to push the container image to Docker Hub (requires an account), then pull it to the cluster.

Details...

Follow these steps:

  1. Allocate a Compute Node (if required) Some clusters require an allocation before running GPU workloads:

    salloc --time=1:00:00 --gres=gpu:1

    Once granted, note the assigned node and connect to it:

    ssh <assigned_node>
  2. Load Required Modules. Ensure Apptainer and CUDA are available:

    module load apptainer
    module load cuda/cuda-12.4.0
  3. Pull the Container Image To use your Docker container with Apptainer, first push it to Docker Hub (requires an account). Then, pull it onto the cluster:

    apptainer pull docker://<your_username>/train-transformer
  4. Run the Container and Check GPU Access

    apptainer run --nv ~/containers/train-transformer_Latest.sif

    Once inside the container (you should see an Apptainer> prompt), verify GPU availability:

    Apptainer> nvidia-smi

    If the GPUs are recognized, you're all set!


Environment Variables

You should create a .env to save several environment variables. For example:

WANDB_API_KEY=... # alternatively, log in to wandb within the container.
NTFY_TOPIC=<your_topic_here> # the topic to which you will publish/subscribe notifications
HYDRA_CONFIG_PATH=configs # the path to your hydra configurations. (Best practice: use a config folder outside the git repository)

Project Structure

.
├── configs/      
|   └── ...       # configuration templates. Check train.yaml first
├── scripts/      
|   └── ...       # bash/slurm scripts to facilitate job execution
├── src/          
|   └── ...       # base classes and code for the project
├── Dockerfile 
├── Makefile
├── pyproject.toml
├── README.md
└── requirements.txt

Using Ntfy

Motivation: Training is a peculiar thing. Sometimes it goes great, and sometimes it doesn't. Unfortunately, when deploying jobs to a compute cluster, I'm rarely sitting at the terminal monitoring training progress. Instead, my job is typically not considered "high-priority", and I must wait minutes or hours before it starts. When the job does start, I want to know ASAP so I can monitor progress and quickly re-submit a job if it failed. This is why I integrated the NtfyCallback into my training scripts.

If you set the NTFY_TOPIC environment variable, the training script will send a notification to that topic when training starts, stops, or when an error occurs. (See the examples below.) Additionally, if you use the Weights and Biases logger, clicking the notification will include the URL to the wandb run.

Start, Finish, and Error notifications received on the Android app.

Ntfy

You can also interrupt the training process at any time using Ntfy as a sort of manual form of early stopping. When training begins, the NtfyCallback spins-up a thread and subscribes to your NTFY_TOPIC. To stop a run at any moment, send the hh-mm-ss timestamp from the run name (e.g., "16-29-15" in the image above) to your NTFY_TOPIC via your phone or the desktop interface, and the monitoring thread will signal to Pytorch Lightning that it should stop the training process.

Notes

  • Configurations: Modify configs/train.yaml to adjust training settings and paths.
  • Logs & Checkpoints: Stored in outputs/ folder organized by date/time of each run.

About

Repository for training chess-playing GPT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published