Welcome to PACE! PACE stands for Partnership for an Advanced Computer Environment. PACE is one of the HPC Clusters available for GT Researchers and students to use for intensive computational tasks. The official PACE website can be found here. This is meant to be made as a living document. If you would like to contribute to this document, please make a PR!
For a student to use PACE, they need a GT PI to allow them access to their free account. Instructions on the PACE website can be found here.
You have to do the following steps:
- If the PI does not have a PACE Account setup, then the PI needs to contact pace-support@oit.gatech.edu to request a free tier account on the Phoenix. This would be useful for the Dolphin Project if they want to use Scott's account
- If the PI already has a PACE Account (like Thad), then you need to email pace-support@oit.gatech.edu with the following information:
- User’s GT Account name (e.g. gburdell3)
- User’s name
- Authorizing person’s (“PI”) GT Account name
- PI’s name
- PI’s direct approval for the account
Access to PACE is only done via ssh
. You need to have the GT VPN Client installed to access PACE whether or not you are on eduroam. Instructions on installing the GT VPN Client are provided within the PACE documentation.
The command to ssh
into PACE is ssh <gtusername>@login-phoenix-slurm@pace.gatech.edu
. An example with the username gburdell3
looks like ssh gburdell3@login-phoenix-slurm@pace.gatech.edu
. The PACE documentation provides additional information for this step. Remember that you need to follow the instructions for the Phoenix Cluster not the Hive Cluster.
You can also use your favorite IDE to access PACE via ssh
. VS Code and PyCharm
Changing the Anaconda symlink is necessary to make sure that you can use Anaconda. Many Anaconda environments take up multiple GBs of space, which means that it's not practical using them on your personal storage on PACE since that is restricted to only 10GB. To this fix, we have a centralized conda environment storage under the project storage, which has a 1TB capacity. This can be fixed using the following command:
ln -s /storage/coda1/p-ts133/0/shared/.conda ~/.conda
Multiple people in your project will be using PACE. It becomes important that these collaborators can view the same codebase. If you were to save your code in your individual folder on PACE, your teammates cannot access that code. To fix this, we make the default directory the shared
directory, which anyone in your PI's team can access:
rm ~/p-ts133-0 # Remove the old symlink
ln -s /storage/coda1/p-ts133/0/shared ~/p-ts133-0
This folder names will change if your PI is not Thad.
The PACE documentation contains a lot of documentation around how to use PACE and get familiar with modules used in PACE. Here's some important links to specific parts of the PACE documentation:
- Phoenix Cluster Documentation
- Example Jobs with SLURM
- PACE Software Stack
- Open OnDemand for Interactive Jobs
- PACE Computational Resources
Important notes on PACE documentation:
- Many parts of the documentations reference PBS scripts. This is deprecated since SLURM is being used. Ignore these PBS scripts
- PACE documentation is behind on how to setup reproducible environments. I had sent PACE staff my documentation on setting up environments, which they still haven't implemented. The environment setup section in this document is all the documentation that I know of for how to set up reproducible environments on PACE.
- We are only using the Phoenix Cluster. Information about the Hive, ICE, or Firebird Cluster should be ignored.
Each user will always have one account, which is the free tier of your professor. However, it's possible for you to have more than one account if you work under multiple PIs or a professor has a paid account in PACE. You can find this information using the pace-quota
command:
Gathering storage and job accounting information for user: gdeshpande9
** Please note that the information and display format of this tool **
** is subject to change and should *not* be used for scripting. **
================================================================================================================
Welcome to the Phoenix Cluster!
================================================================================================================
* Your Name (as PACE knows it) : Gururaj Mahendra Deshpande
* UserID : 1019709
* Username : gdeshpande9
* Your Email (for PACE contact) : gurudesh@gatech.edu
================================================================================================================
Phoenix Storage with Individual User Quota
================================================================================================================
Filesystem Usage (GB) Limit % File Count Limit %
Home:/storage/home/hcoda1/9/gdeshpande9 4.6 10.0 45.7% 42625 1000000 4.3%
Scratch:/storage/scratch1/9/gdeshpande9 2913.8 15360.0 19.0% 864256 1000000 86.4%
================================================================================================================
Phoenix Storage with Research Group Quota
================================================================================================================
Filesystem Usage (GB) Limit % File Count Limit %
/storage/coda1/p-ts133/0 293.2 1024.0 28.6% 1624115 0 0.0%
================================================================================================================
Job Charge Account Balances
================================================================================================================
Name Balance Reserved Available
gts-ts133 52.95 0.00 52.95
gts-ts133-paid 9795.27 0.00 9795.27
On the very last two lines, you can see that I have two accounts gts-ts133
and gts-ts133-paid
. gts-ts133
is the free tier account. It is important to know which account to use since you have specify the correct account when you schedule jobs.
Each month, regardless of the amount remaining in the free-tier account, the balance for the free-tier is set back to $68.00.
You might be asking what the "Reserved" Column is for. The reserved column shows how much money has been claimed for a job in progress. If you want to see how much money your job claimed without doing the math, you can run the job, then immediately after, run pace-quota
, see how much was reserved and cancel your job accordingly. More information on jobs is dicussed on the Running Jobs on PACE section.
Storage on PACE is a bit complicated. There are three types of storage on PACE each serving a different purpose.
You can always find how much storage each of these buckets are holding using the pace-quota
command.
The personal storage is storage that you as the user can claim to your name. It's pretty small at 10 GB, and can get filled very quickly. It is for this reason that I don't use the personal storage at all.
The project storage is storage that is reserved for your PI and your PI's group. The limit is 1 TB, which might sound like a lot, but really isn't. I recommend that you use the project space to store code and some of the intermediatary results file from jobs.
The scratch storage is probably the more interesting of all the storage options. It can hold 15 TB of data. However there's two catches with this. First, scratch storage is limited to you as the user. This means that if you have multiple collaboraters, then each collaborater would have to store the same data in their scratch storage. Second, data in the scratch storage is wiped after 60 days. It is important that you have all your important data in the scratch directory backed up externally. For PopSign, we use Ebisu to store all of our data.
PACE recommends that you use Globus to transfer files to and from PACE. I have tried this before, but it is very buggy. I personally recommend using scp
or rsync
as ways to transfer files to and from PACE.
In general, the structure for using rsync
is as such:
rsync -a --progress <SOURCE COMPUTER>:<SOURCE FILES/FOLDER> <DESTINATION COMPUTER>:<DESTINATION FOLDER>
Here's an example command to transfer files to PACE from Ebisu (an external computer in Thad's lab):
rsync -a --progress gdeshpande9@ebisu.cc.gatech.edu:/home/gdeshpande9/home.txt gdeshpande9@login-phoenix-slurm@pace.gatech.edu:/storage/home/hcoda1/9/gdeshpande9/scratch
This transfers the hello.txt
from Ebisu to PACE.
Using scp
is very similar as well. I won't cover scp
since rsync
was already covered. If interested, you can find more information on scp
here.
You cannot use conda
directly once you log into PACE. You need to load in the conda
module. This is the process for any other modules that PACE offers. You can reference the PACE documentation to see all the offered software.
To load in conda
, run the command module load anaconda3
. You should see that the base
environment is now active. You must create your environment. You cannot use the base
environment since it not writable. From here, all the regular conda
commands to create, activate, and install packages into environments will work as standard. To avoid unexpected issues, please make sure to follow the intro steps.
PACE doesn't have every module in the world available to use. An example of this is HTK. This means that we have to provide a container that contains this informatation and give it to PACE so we can run experiments with our environment but leverage the Phoenix Cluster. This process is straightforward, but requires back-and-forth with PACE Support and is a multi-month process. If you have specific requirements for an environment, first check that the existing environments that I put on PACE satisfy your requirements. More information about these existing containers can be found here.
If you really need to create a new environment, here's the steps that you would need to follow:
- Create a Docker image with your requirements
- Publish this Docker to Docker Hub
- Send an email to PACE OIT Support requesting for a new container. The email should contain the following
- Your Name
- PI's Name
- Link to Docker image on Docker Hub
- Link/upload of your Dockerfile used to create the image
- Legitimate reasons why you need this container and why PACE cannot natively reproduce this environment with their modules. If this reason cannot be provided, your request will be rejected.
- Ask for the container to be placed in the project storage.
PACE will create an apptainer container. Apptainer is very similar to Docker, but it retains user permissions. Apptainer can also be interchanged with singularity (that was the original name).
For more information, please contact me for specifics.
This is the most important section of this entire document. We'll talk about how to run jobs and get computation from PACE. There's two types of ways to schedule resources from PACE called interactive and non-interactive jobs. Interactive jobs are the easier of the two.
Interactive Jobs are one of the most easiest jobs to get started with in PACE.
PACE has really put initiative into adding Open OnDemand to their platform, which creates a nice UI that handles everything neatly. Interactive Jobs can do all types of jobs well except for Job Arrays and sometimes environment setup with Apptainer (except for the terminal/command line option), which a non-interactive job handles better. You can go here to access Open OnDemand for the Phoenix Cluster. Note that you will have to sign into SSO and be on the GT VPN to access it.
Here is what the home page looks like:
You have four tabs at the top. To create an interactive job, click on the Interactive Apps tab. The dropdown options are Jupyter, Matlab, RStudio, Interactive Shell. For this demonstration, let's assume that I will create a Jupyter Server on PACE.
Here are the options that I have when creating a Jupyter Server
You can choose between 4 Anaconda environments. Two of them are a Tensorflow environment and the other is a PyTorch environment. I personally haven't been able to use the default Anaconda option but still use a different conda environment that I had already setup.
You can change the account. It uses the default free-tier option.
Quality of Service allows you to tradeoff price and wait time. If you select inferno, this means that you want to be in the regular queue and pay full price. If you select embers, it means that you are willing to stay in the queue for longer but have either no charge or minimal charge. I always use inferno.
In Node Type, you can specify if you want a GPU node. There's 3 different types of GPUs you can get including a V100, RTX 6000, and an A100.
You can specify the number of nodes you want. I always select 1 for interactive jobs, since it's difficult to manage multiple nodes in an interactive job.
You can also specify the number of cores for node. The maximum is 24 cores.
Memory per core determines the amount of RAM a node receives.
Number of hours dictates how long you want this to run for.
After starting the job, you will see this:
This means that your request is in the queue.
After some time, the page will change to this:
You can now access your Jupyter Server. It will give you the exact same experience as working with Jupyter locally. This same experience applies to the other options available in Open OnDemand such as VS Code, Terminal, etc.
I personally don't use Open OnDemand as much. The reason is because I use VS Code remote access feature, which already gives me exactly what I'm looking for. I also mainly schedule non-interactive jobs. I believe that Open OnDemand is useful, and I have used it before as well. It's a great introduction into PACE for beginners. It's ease of use is why PACE promotes it so much to draw researchers to use PACE in its easiest form possible.
Another way to start an interactive job is to use the salloc
command. The salloc
command allows you to stay within your current terminal in PACE and enter the machines requested for the job. The advantage of this is that you don't need to leave your terminal. The disadvantage is that this is less powerful compared to the options offered in PACE's Open OnDemand instance and it also requires decent SLURM knowledge.
salloc
, like sbatch
requires different flags to set the configuration for the job. sbatch
and salloc
share the same flags, so if you have flags defined in an .sbatch
file, then you can edit it to immediately use salloc
. This section is written assuming that you have knowledge of sbatch
. Please read about sbatch
in the non interactive job section.
As an example, let's use an existing sbatch
configuration and convert it so that it can be used with salloc
. In the example single node CPU Job configuration, the job configuration was defined as the following:
#SBATCH --job-name=Task
#SBATCH --account=gts-ts133
#SBATCH --nodes=1 --ntasks-per-node=24
#SBATCH --time=2:00:00
#SBATCH --qos=inferno
#SBATCH --output=./logs/task.out
#SBATCH --mail-type=NONE
To use salloc
with the same job configuration, you should copy all flags except for --job-name
, --output
, --mail-type
since salloc
starts a job instantly instead of putting a job in a queue like sbatch
. The final salloc
command would be the following:
salloc --account=gts-ts133 --nodes=1 --ntasks-per-node=24 --time=2:00:00 --qos=inferno
Like mentioned before, there are some limitations to using Open OnDemand and interactive jobs. The two main drawbacks are the inability to effectively manage job arrays, and the inability to use your own container. Non-interactive jobs allow you to control SLURM, which is a scheduler that PACE uses to get you the compute you request for. Open OnDemand also uses SLURM, but you get to control it through a nice looking UI with those options I mentioned earlier. We'll take a look at how to manipulate SLURM to do more complicated tasks.
The single node CPU job is one of the simplest jobs to schedule on PACE. This, at minimum requires one file, but can require two files if you are using apptainer
to create a controlled environment.
The two main important files are the .sbatch
file and the other one is the .sh
file. The .sbatch
file is how you configure SLURM to get the correct configuration you want. The .sh
actually executes the task at hand. Let's show what an example .sbatch
file would look like:
#!/bin/bash
#SBATCH --job-name=Task
#SBATCH --account=gts-ts133
#SBATCH --nodes=1 --ntasks-per-node=24
#SBATCH --time=2:00:00
#SBATCH --qos=inferno
#SBATCH --output=./logs/task.out
#SBATCH --mail-type=NONE
cd $SLURM_SUBMIT_DIR
apptainer exec \
--bind SOURCE:DESTINATION_IN_CONTAINER
/storage/coda1/p-ts133/0/containers/copycat-gpu-cuda10.2-cudnn7.sif \
sh task.sh
The configurations are defined in the #SBATCH
lines. We define the job name, account, nodes, time, the quality of service, where to store the output log, and whether we want to receive emails or not.
The cd $SLURM_SUBMIT_DIR
tells SLURM to enter the directory where the job was submitted from.
The line below that does three things:
- Initialize the apptainer environment
- Binds folders from PACE into the container.
--bind
allows read and write access - Runs the
task.sh
file
Here's what the task.sh
file looks like:
#!/bin/sh
cd /CODE-DIRECTORY || exit
python3 task.py
Here, we tell SLURM to change directory to where our code lives and run a python script. Remember that you have to use the path that's inside the container -- not the PACE path.
Inside the Templates
folder, I provide a template of how to create a single node CPU Job. These two files are provided.
Running Single Node GPU Jobs is a simple change in the .sbatch
file. Let's see what the change looks like:
#!/bin/bash
#SBATCH --job-name=Task
#SBATCH --account=gts-ts133
#SBATCH --nodes=1 --ntasks-per-node=12 --gres=gpu:V100:1
#SBATCH --time=2:00:00
#SBATCH --qos=inferno
#SBATCH --output=./logs/task.out
#SBATCH --mail-type=NONE
cd $SLURM_SUBMIT_DIR
singularity exec \
--bind SOURCE:DESTINATION_IN_CONTAINER
/storage/coda1/p-ts133/0/containers/copycat-gpu-cuda10.2-cudnn7.sif \
sh task.sh
The only line that was changed was this one:
#SBATCH --nodes=1 --ntasks-per-node=12 --gres=gpu:V100:1
Here, we add an extra tag to schedule a V100 GPU. For each V100 GPUs, we can only reserve 12 CPU cores. For each RTX 6000 GPU, we can only reserve 6 CPU cores. This makes sense as we can reserve 2 V100 GPUs and get 24 cores or 4 RTX 6000 GPUs and get 24 cores. Either way, we cannot get more than 24 cores.
Running Multi Node GPU Jobs is a simple change in the .sbatch
file. Let's see what the change looks like:
#!/bin/bash
#SBATCH --job-name=Task
#SBATCH --account=gts-ts133
#SBATCH --nodes=3 --ntasks-per-node=24 --gres=gpu:V100:2
#SBATCH --time=2:00:00
#SBATCH --qos=inferno
#SBATCH --output=./logs/task.out
#SBATCH --mail-type=NONE
cd $SLURM_SUBMIT_DIR
singularity exec \
--bind SOURCE:DESTINATION_IN_CONTAINER
/storage/coda1/p-ts133/0/containers/copycat-gpu-cuda10.2-cudnn7.sif \
sh task.sh
The only line that was changed was this one:
#SBATCH --nodes=3 --ntasks-per-node=24 --gres=gpu:V100:2
Compared to the previous .sbatch
file, we add 2 GPUs per node, but also reserve 3 nodes in total. This means that in this one job alone, we have reserved 6 GPUs. If you want to do distributed deep learning training, PyTorch provides an easy way to handle these multiple node situations. PyTorch Lightning makes it even easier.
Before we get started with how to schedule job arrays, we should answer these two very important questions:
- What is a Job Array?
- Why would anyone use one?
A job array is a way to schedule multiple nodes at the same time. This uses a fundamentally different technique that scheduling one job where you request multiple nodes. See the diagram below for the difference:
Here, with the simple one request for multiple nodes, we create one job where we get multiple nodes. However, your code must be able to access each of those nodes and manage computations on them. This is most often done with something called MPI (message passing interface) where you can send and receive messages from computers. However, implementing MPI can take time, and in Thad's lab, we unfortunately don't have that much time.
A job array is different because you submit one request. Then, SLURM submits multiple requests requesting for the same node configuration. These nodes all have an identical environment setup and compute configuration. Notice how it is possible for that not all of the job array nodes will be available at the same time since each individual job array node request goes through the queue. However, it's much easier to manipulate job arrays since you can batch your input based on the job array node you are on. This can be done by accessing the SLURM_ARRAY_TASK_ID
environment variable.
There are many cases where you have a process
However, we will find limitations even in concurrency. Multiprocessing, for example, depends on the number of cores your processor has. Even the number of cores on one processor is limited based on power draw and other thermal problems that arise with too many cores.
We can then come to a realization that we need more than one computer to help us. We can split
You can also think of this as a way of constructing an enormous computer with
Let me give you a real example and get away from the variables for a second. Let's say that I have 200,000 videos in .mp4
format, and I want to convert them each to a 4D NumPy Array (number of frames, R, G, B)
to increase loading time into my PyTorch model. On average, we can say that the conversion of one vidoe to a NumPy Array takes 2 minutes. In Thad's Lab, we have a computer called Ebisu that has 32 cores. If we were to parallelize this operation (assumming that Ebisu had infinite memory to store all the videos in RAM), it would take the following amount of time to complete:
12,500 minutes is about 8 days.
If we were to use a job array where I can schedule 100 of these 24 core nodes, we can see it would take the following amount of time to complete:
174 minutes is around 3 hours.
We can see how using a job array makes things faster!
We want our job arrays to follow this process
First, we batch all of our data into text files. This part of the process should happen very quickly as we are scanning the filesystem to files to process and creating text files with a roughly equal number of input files.
Then, we run the job array. In each node, we read in the batch file associated with that node using the SLURM_ARRAY_TASK_ID
(note that this starts at 1 and not 0). Then, in each node, we parallelize across the inputs files. We can then see our outputs.
Let's see how the .sbatch
file has changed:
#!/bin/bash
#SBATCH --job-name=Job-Array-Template
#SBATCH --account=gts-ts133
#SBATCH --nodes=1 --ntasks-per-node=24
#SBATCH --time=4:00:00
#SBATCH --qos=inferno
#SBATCH --output=logs/Task%a.out
#SBATCH --mail-type=NONE
#SBATCH --array=1-100
cd $SLURM_SUBMIT_DIR
apptainer exec --nv \
--bind SOURCE:DESTINATION
/storage/coda1/p-ts133/0/containers/copycat-gpu-cuda10.2-cudnn7.sif \
sh run_task.sh
Here, we have added an extra #SBATCH
line to specify that we want 100 nodes in our job array. How has the .sh
file changed?
#!/bin/sh
cd /CODE-DIRECTORY || exit
python3 task.py \
--jobArrayNum "${SLURM_ARRAY_TASK_ID}"
Here, now we pass in an argument of the SLURM_ARRAY_TASK_ID
into our script.
In the Templates folder, I provide an outline on how to run a job array with these files.
squeue -A <ACCOUNT>
- Checks jobs related to a PACE accountscancel <JOB_ID>
- Cancels a job on PACEmodule load anaconda3
- Loads Anaconda (so you can create your own conda environments)module spider
- Shows all PACE modules and their versions
pace-quota
- Gives information on PACE Storage and money in accounts
You can find me on the CCG Discord @ Gururaj Deshpande. My discord username is @sjxjx. My GT email address is gurudesh@gatech.edu