You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There have been several researchers who ran into problems when trying to run older software (including Docker images) on the A100 GPUs. This discussion will track what we've learned and current recommendations.
The central issue is that the A100s have CUDA compute capability 8.0. This requires CUDA >= 11.0 (exception discussed below). Attempting to run software or a container with CUDA < 11.0 can cause errors that are difficult to diagnose because it is not obvious the CUDA version incompatibility is the root cause. #10 gives an example failure on our TensorFlow example.
One solution is to update the software environment or container to use a newer CUDA. #11 demonstrates doing this with PyTorch run via conda. I also have a cryoDRGN example that is similar. A researcher running bonito may have solved their problem by using a Singularity container with CUDA 11.2.
The exception noted above is that @bbockelm found NVIDIA documentation that describes how older CUDA can run on newer GPUs using PTX. This could be used so that groups compiling software or building containers could use older CUDA and run on newer hardware. Or it could provide a quick way to check if a black box container is expected to work on newer GPUs.
In terms of documentation, we should improve our recommendations for how to set submit file requirements when a job runs with CUDA < 11.0. Requiring CUDACapability < 8 could help eliminate errors.
@ChristinaLK are there other researcher issues related to A100 compatibility we can summarize here?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
There have been several researchers who ran into problems when trying to run older software (including Docker images) on the A100 GPUs. This discussion will track what we've learned and current recommendations.
The central issue is that the A100s have CUDA compute capability 8.0. This requires CUDA >= 11.0 (exception discussed below). Attempting to run software or a container with CUDA < 11.0 can cause errors that are difficult to diagnose because it is not obvious the CUDA version incompatibility is the root cause. #10 gives an example failure on our TensorFlow example.
One solution is to update the software environment or container to use a newer CUDA. #11 demonstrates doing this with PyTorch run via conda. I also have a cryoDRGN example that is similar. A researcher running bonito may have solved their problem by using a Singularity container with CUDA 11.2.
The exception noted above is that @bbockelm found NVIDIA documentation that describes how older CUDA can run on newer GPUs using PTX. This could be used so that groups compiling software or building containers could use older CUDA and run on newer hardware. Or it could provide a quick way to check if a black box container is expected to work on newer GPUs.
In terms of documentation, we should improve our recommendations for how to set submit file requirements when a job runs with CUDA < 11.0. Requiring
CUDACapability < 8
could help eliminate errors.@ChristinaLK are there other researcher issues related to A100 compatibility we can summarize here?
Beta Was this translation helpful? Give feedback.
All reactions