A100 and CUDA compatibility issues #12

agitter · 2021-09-28T14:17:06Z

agitter
Sep 28, 2021

There have been several researchers who ran into problems when trying to run older software (including Docker images) on the A100 GPUs. This discussion will track what we've learned and current recommendations.

The central issue is that the A100s have CUDA compute capability 8.0. This requires CUDA >= 11.0 (exception discussed below). Attempting to run software or a container with CUDA < 11.0 can cause errors that are difficult to diagnose because it is not obvious the CUDA version incompatibility is the root cause. #10 gives an example failure on our TensorFlow example.

One solution is to update the software environment or container to use a newer CUDA. #11 demonstrates doing this with PyTorch run via conda. I also have a cryoDRGN example that is similar. A researcher running bonito may have solved their problem by using a Singularity container with CUDA 11.2.

The exception noted above is that @bbockelm found NVIDIA documentation that describes how older CUDA can run on newer GPUs using PTX. This could be used so that groups compiling software or building containers could use older CUDA and run on newer hardware. Or it could provide a quick way to check if a black box container is expected to work on newer GPUs.

In terms of documentation, we should improve our recommendations for how to set submit file requirements when a job runs with CUDA < 11.0. Requiring CUDACapability < 8 could help eliminate errors.

@ChristinaLK are there other researcher issues related to A100 compatibility we can summarize here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A100 and CUDA compatibility issues #12

{{title}}

Replies: 0 comments

Select a reply

A100 and CUDA compatibility issues #12

agitter Sep 28, 2021

Replies: 0 comments

agitter
Sep 28, 2021