-
Notifications
You must be signed in to change notification settings - Fork 33
Compute Canada
This page serves as internal documentation for setting up the project on one of Compute Canada's clusters.
The following bash script can be saved to setup.sh
and run on a Compute Canada cluster to set the project up. Once the script has completed running, you will have installed the project and all dependencies to $SCRATCH
, and the virtual environment you created at $ENV
will be active.
setup.sh
#!/bin/bash
PROJECT_NAME="t2t"
ENV="$HOME/$PROJECT_NAME"
module load python/3.7 cuda/10.1
# if on cedar: module load python/3.7 nixpkgs/16.09 intel/2019.3 cuda/10.1
# Create and activate a virtual environment
virtualenv --no-download $ENV
source $ENV/bin/activate
pip install --no-index --upgrade pip
# Install AllenNLP from source
cd $SCRATCH
git clone https://github.com/allenai/allennlp.git
cd allennlp
pip install --editable .
cd ../
# Install the project
git clone https://github.com/JohnGiorgi/t2t.git
cd t2t
pip install --editable .
There is one final step: because the compute nodes are air-gapped, you must download the pre-trained model on a login node. To do this, run the following:
PROJECT_NAME="t2t"
ENV="$HOME/$PROJECT_NAME"
OUTPUT="$SCRATCH/$PROJECT_NAME"
module load python/3.7 cuda/10.1
# if on cedar: module load python/3.7 nixpkgs/16.09 intel/2019.3 cuda/10.1
source $ENV/bin/activate
python
>>> from transformers import AutoModel, AutoTokenizer
>>> pretrained_transformer_model_name = "distilroberta-base"
>>> model = AutoModel.from_pretrained(pretrained_transformer_model_name)
>>> tokenizer = AutoModel.from_pretrained(pretrained_transformer_model_name)
>>> exit()
this will download and cache the weights of the chosen pretrained_transformer_model_name
. See here for an up-to-date list of pre-trained checkpoints.
To install NVIDIAs Apex and take advantage of mixed precision training, submit the following script with sbatch install_apex.sh
Note, different clusters have different GPUs available, and you need to choose this with the
#SBATCH --gres
command. See here for detailed instructions.
install_apex.sh
#!/bin/bash
# Requested resources
#SBATCH --mem=2G
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
# Wall time and job details
#SBATCH --time=0:30:00
#SBATCH --job-name=install-apex
# Emails me when job starts, ends or fails
#SBATCH --mail-user=johnmgiorgi@gmail.com
#SBATCH --mail-type=END,FAIL
PROJECT_NAME="t2t"
ENV="$HOME/$PROJECT_NAME"
WORK="$SCRATCH"
module load python/3.7 httpproxy cuda/10.1
# if on cedar: module load python/3.7 httpproxy nixpkgs/16.09 intel/2019.3 cuda/10.1
source "$ENV/bin/activate"
cd $WORK
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Once setup.sh
has been run successfully, the script train.sh
can be submitted with sbatch train.sh
in order to schedule a job.
A few things of note:
- All hyperparameters are selected in the JSON file at
$CONFIG_FILEPATH
. If you want to make quick changes at the command line, you can use--overrides
, e.g.--overrides "{'data_loader.batch_size': 16}"
to modify the config in place. - All output is saved to
$OUTPUT
. - Tensorboard logs exist at
$OUTPUT/log
, so you can calltensorboard --log-dir $OUTPUT/log
to view them. Note however, that the compute nodes are air-gapped, so you will need to copy$OUTPUT/log
to a login node, or your local computer, before runningtensorboard
. - In
train.sh
, there is an example call withsalloc
, which you can use to request the job interactively (for things like debugging). These jobs should be short. - Different clusters have different GPUs available, and you need to choose this with the
#SBATCH --gres
command. See here for detailed instructions.
train.sh
#!/bin/bash
# Requested resources
#SBATCH --mem=16G
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
# Wall time and job details
#SBATCH --time=7:00:00
#SBATCH --job-name=t2t-train
# Emails me when job starts, ends or fails
#SBATCH --mail-user=user@example.com
#SBATCH --mail-type=END,FAIL
# Use this command to run the same job interactively
# salloc --mem=16G --cpus-per-task=6 --gres=gpu:1 --time=1:00:00
PROJECT_NAME="t2t"
ENV="$HOME/$PROJECT_NAME"
OUTPUT="$SCRATCH/$PROJECT_NAME"
WORK="$SCRATCH/$PROJECT_NAME/$PROJECT_NAME"
# Path to the AllenNLP config
CONFIG_FILEPATH="$WORK/configs/contrastive.jsonnet"
# Directory to save model, vocabulary and training logs
SERIALIZED_DIR="$OUTPUT/tmp"
# Load the required modules and activate the environment
module load python/3.7 cuda/10.1
source "$ENV/bin/activate"
cd $WORK
# Run the job
allennlp train $CONFIG_FILEPATH \
--serialization-dir $SERIALIZED_DIR \
--include-package t2t
To evaluate on SentEval, you will first need to install it. Then set SENTEVAL_DIR
and SERIALIZED_DIR
, the path to the serialized AllenNLP model you want to evaluate, in the following script.
A file, senteval_results.json
will be saved to SERIALIZED_DIR
. This JSON contains detailed scores from each task along with aggregate scores.
run_senteval_allennlp.sh
#!/bin/bash
# Requested resources
#SBATCH --mem=50G
#SBATCH --cpus-per-task=6
#SBATCH --gres=gpu:1
# Wall time and job details
#SBATCH --time=5:00:00
#SBATCH --job-name=senteval-allennlp
# Emails me when job starts, ends or fails
#SBATCH --mail-user=user@example.com
#SBATCH --mail-type=END,FAIL
# Use this command to run the same job interactively
# salloc --mem=50G --cpus-per-task=6 --gres=gpu:1 --time=1:00:00
PROJECT_NAME="t2t"
ENV="$HOME/$PROJECT_NAME"
OUTPUT="$SCRATCH/$PROJECT_NAME"
WORK="$SCRATCH/$PROJECT_NAME"
# Set path to AllenNLP archive and SentEval
SENTEVAL_DIR="SentEval"
SERIALIZED_DIR=""
# Load the required modules and activate the environment
module load python/3.7 cuda/10.1
source $ENV/bin/activate
cd $WORK
# Run the job
python scripts/run_senteval.py allennlp $SENTEVAL_DIR $SERIALIZED_DIR \
--output-filepath $SERIALIZED_DIR/senteval_results.json \
--cuda-device 0 \
--predictor-name "contrastive" \
--include-package t2t \
--verbose