HPC clusters always have a job scheduler which dispatches jobs submitted to a queue. You can refer to this official documentation for how to use the job scheduler.
- PBS Pro: Documentation
- SLURM: Documentation
In general, HPC clusters provide a development node for you to test your code. For example, you can use the dgx-dev
queue on NSCC to run interactive job and debug you code on one node with 4 GPUs.
Development nodes usually have resource limitation such as wall time. It is also bad when your terminal gets stuck as you have to quit and re-submit the job again. Thus, I would recommend to use JupyterLab to debug your job. This has several advantages:
- You can have multiple terminals
- You can edit your files
- Your job is still running if you disconnect
If you are not a vim enthusiast (even though I highly recommend it as you can make it a pro IDE if you install plugins), you can use JupyterLab for convenience.
You can refer to the NSCC notes on how to set up JupyterLab.
You can refer to the following documentation on how to use IDE on clusters.
- VS Code: documentation
- PyCharm: documentation
Sometimes, you may need to run many experiments while each experiment only takes several minutes. Thus, it is definitely not a good idea to submit a job script for each experiment. For example, I only want to profile the peak memory usage in one iteration of my training process. In this case, it is recommended to use JupyterLab for experiments. This has several advantages:
- You don't have to worry about program error when you run many experiments. For example, when you only submit a job
to the scheduler, the job will get stuck if your program runs into out-of-memory. However, if you run JupyterLab,
you can just terminate it with
ctrl+c
and continue with your next try. - JupyterLab is running in the job dispatched by the scheduler. Thus, the environment inherits variables passed by the
scheduler, for example, the
NODEFILE
variable of PBS Pro.
Network interface needs to be specified for distributed training. This is to tell which interface to rely on for
communication. For example, when running PyTorch distributed, you may find that your script stuck at the initialization
of the default process group. Usually this is because PyTorch does not know which interface to use to cross-node
communication, or some found interfaces have issues. You can specify the network interface for communication by setting the
environment variables NCCL_SOCKET_IFNAME
or GLOO_SOCKET_IFNAME
depending on the backend of your choice.
You can check the available network interfaces by the command ifconfig
or use
PyRoute2 if this command is not available. You can also get the host address in
this way as well.
from pyroute2 import NDB
ndb = NDB(log='debug')
print(ndb.addresses.summary())
For example, you can find ib0
in the list on NSCC. It means InfiniBand and you can set NCCL_SOCKET_IFNAME=ib0
in
your script. Base on test, if you do not set this environment variable, your PyTorch initialization will stuck.
Use Proxyjump to directly connect to the compute node (requires having a job running on the compute node)
Host computenode_hostname
HostName computenode_hostname
User username
ProxyJump loginnode_hostname
ServerAliveInterval 60
Host loginnode_hostname
HostName loginnode_hostname
User username