This repository is used to train a chess-playing transformer on UCI move sequences. It provides some very nice features:
- Automated setup with Makefile
- Distributed training using Pytorch Lightning
- Containerized environment for fast-deployment via Docker or Apptainer
- Configuration management via Hydra
- Training monitoring via Weights & Biases
- Start/stop/error notifications via Ntfy.sh, plus the ability to interrupt training remotely via notify.sh
- GNU Make (shortcut command for building/running the container)
- Docker Engine (Installation Guide): Follow installation steps for your Linux distribution, or use Docker Desktop for Windows
- NVIDIA Container Toolkit (Installation Guide): Follow steps for Installation and Configuration
Move Docker's default data-dir (Only if neeeded)
On my system, I have a lot of free space at /home
, but very little in docker's default directory. Run the following commands to update Docker to store its data in a different directory.
-
Shutdown Docker service
sudo systemctl stop docker docker.socket sudo systemctl status docker
-
Move data to the new path (if it's not already there)
sudo mkdir -p /etc/docker sudo rsync -avxP /var/lib/docker/ /home/docker/ echo '{ "data-root": "/home/docker" }' | sudo tee /etc/docker/daemon.json
-
Restart the Docker services
sudo systemctl restart docker
-
Clone the Repository First, clone the repository into your desired project directory:
git clone https://github.com/austinleedavis/train-transformer.git cd train-transformer
-
Build the Docker Container To set up your development environment, build the Docker container with all required dependencies:
make docker-build
Once the build completes, you have multiple options for accessing the environment.
Post-build, there are three options to access the development environment.
If you use VS Code, you can work inside the container with Dev Containers:
Details...
- Install the Dev Containers extension.
- Open the project in VS Code.
- Open the command palette (Ctrl+Shift+P / Cmd+Shift+P) and select:
Dev Containers: Reopen in Container
This will start a development session inside the Docker environment.
You can send non-interactive scripts to be executed inside the Docker container.
Details...
Run:
docker run --rm -v $(pwd):/workspace $(basename $(pwd)):latest bash -c "./scripts/train.sh"
For systems that use Apptainer instead of Docker (e.g., managed HPC clusters), it's easiest to push the container image to Docker Hub (requires an account), then pull it to the cluster.
Details...
Follow these steps:
-
Allocate a Compute Node (if required) Some clusters require an allocation before running GPU workloads:
salloc --time=1:00:00 --gres=gpu:1
Once granted, note the assigned node and connect to it:
ssh <assigned_node>
-
Load Required Modules. Ensure Apptainer and CUDA are available:
module load apptainer module load cuda/cuda-12.4.0
-
Pull the Container Image To use your Docker container with Apptainer, first push it to Docker Hub (requires an account). Then, pull it onto the cluster:
apptainer pull docker://<your_username>/train-transformer
-
Run the Container and Check GPU Access
apptainer run --nv ~/containers/train-transformer_Latest.sif
Once inside the container (you should see an
Apptainer>
prompt), verify GPU availability:Apptainer> nvidia-smi
If the GPUs are recognized, you're all set!
You should create a .env
to save several environment variables. For example:
WANDB_API_KEY=... # alternatively, log in to wandb within the container.
NTFY_TOPIC=<your_topic_here> # the topic to which you will publish/subscribe notifications
HYDRA_CONFIG_PATH=configs # the path to your hydra configurations. (Best practice: use a config folder outside the git repository)
.
├── configs/
| └── ... # configuration templates. Check train.yaml first
├── scripts/
| └── ... # bash/slurm scripts to facilitate job execution
├── src/
| └── ... # base classes and code for the project
├── Dockerfile
├── Makefile
├── pyproject.toml
├── README.md
└── requirements.txt
Motivation: Training is a peculiar thing. Sometimes it goes great, and sometimes it doesn't. Unfortunately, when deploying jobs to a compute cluster, I'm rarely sitting at the terminal monitoring training progress. Instead, my job is typically not considered "high-priority", and I must wait minutes or hours before it starts. When the job does start, I want to know ASAP so I can monitor progress and quickly re-submit a job if it failed. This is why I integrated the NtfyCallback into my training scripts.
If you set the NTFY_TOPIC
environment variable, the training script will send a notification to that topic when training starts, stops, or when an error occurs. (See the examples below.) Additionally, if you use the Weights and Biases logger, clicking the notification will include the URL to the wandb run.
You can also interrupt the training process at any time using Ntfy as a sort of manual form of early stopping.
When training begins, the NtfyCallback spins-up a thread and subscribes to your NTFY_TOPIC.
To stop a run at any moment, send the hh-mm-ss
timestamp from the run name (e.g., "16-29-15" in the image above) to your NTFY_TOPIC via your phone or the desktop interface, and the monitoring thread will signal to Pytorch Lightning that it should stop the training process.
- Configurations: Modify
configs/train.yaml
to adjust training settings and paths. - Logs & Checkpoints: Stored in
outputs/
folder organized by date/time of each run.