The desired behavior of EasyBuild for `--cuda-compute-capabilities` is ill defined #4770

casparvl · 2025-02-21T19:29:27Z

The issue

I've considered this before, and written things out here. I'll repeat key things here, and propose improvements.

The fundamental issue is that a single --cuda-compute-capabilities does not represent the complexity of CUDA code compilation.

CUDA code is compiled in two stages. First it is compiled into a in intermediate representation, PTX, which can be considered as assembly for a virtual GPU architecture. A virtual architecture is defined by a set of capabilities and features it can provide to the application. PTX is essentially in text-format, there is no binary encoding here. The second stage is that the intermediate representation is compiled into device code, i.e. a binary. See this documentation and this image.

To complicate things further, CUDA support JIT compilation from the PTX code into device code, see this section and this image. That requires the PTX code to be included in the binary - which is optional.

As a result, it is unclear what behavior one should expect when setting --cuda-compute-capabilities=8.0,9.0 for example. Probably everyone expects device code to be built for the real architectures sm_80 and sm_90 . What is undefined is what virtual architecture is used to get there. E.g. both

nvcc hello.cu --gpu-architecture=compute_80 --gpu-code=sm_80,sm_90 -o hello

and

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 -o hello

will give a binary for the architectures sm_80 and sm_90. The difference being that in the second case, the intermediate representation used to compile the sm_90 device code was the compute_90 virtual architecture, whereas in the first case the compute_80 virtual architecture is used. The second code could be more performant, if features unique to the compute_90 virtual architecture are used that are not present in the compute_80 virtual architecture.

Another thing that is undefined is which, if any, PTX code will be shipped in the binary. If it does

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 -o hello

there will be no PTX code in the binary. But it could also do

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_90,code=compute_90 -o hello

or even

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_80,code=compute_80 --generate-code=arch=compute_90,code=compute_90 -o hello

(although there is little point in shipping both CC80 and CC90 PTX codes if both the CC80 and CC90 device code is present).

Of course, note that even if we clearly define what we want --cuda-compute-capabilities to mean in EasyBuild, we may not be able to always achieve that with the build system. Some codes can e.g. only be compiled for a single architecture, so in that case a --cuda-compute-capability=8.0,9.0 is 'impossible' anyway. Or, there might be codes (e.g. the CUDA-Samples) for which you pass the desired compute capabilities as an argument and then they do something with that that is defined by their own build system (in the case of CUDA-Samples, several binaries are not compiled with the capabilities passed to the SMS argument of their make command). But: those issues are not fundamentally different from e.g. a build system that doesn't properly respect an optarch setting. We try to patch them, and if we can't, we document the limitation in the EasyConfig/EasyBlock, and that's it.

Proposed solution

There are a couple of actions needed here:

Improve documentation on --cuda-compute-capabilities to state what this should do. For example (and I think this IS what it should do): it will create device code (i.e. binary code for a real GPU architecture) for the requested compute capability, using the virtual architecture from that same compute capability. I.e. for 8.0,9.0 it will try to pass the flags --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90. We will be able to check at least which device code is generated with a sanity check - as is being implemented in add a CUDA device code sanity check #4692 (unfortunately, we can't check for the virtual compute capability that was used to do so)
We need to create a robust way to pass those flags. Currently, we rely on being able to somehow ask the build system to build for a certain arch. Unfortunately, there is no equivalent to how we set certain CFLAGS based on the optarch (which tends to work for pretty much any build system), as there are no nvidia-specific flags that are conventionally respected by build systems (i.e. there is no standardized NVCCFLAGS or something). And we can't add them to CFLAGS as that would make regular C-compiler invocations fail, since they won't know those arguments. One way could be to make an nvcc compiler wrapper, and inject the flags that way. This is probably the most robust and generic way to get these flags passed to all nvcc invocations. Note that the build system might pass contradictory flags, which may cause the wrapped flags to be ignored, or worse, they might conflict and cause failures. That's something we'll have to try, see how often it happens, and see if it can be dealt with (e.g. patch the build system to omit it's own flags). Also note that something similar happens for CPU: many build systems pass -O<something> flags, that sometimes overwrite what EB users set in optarch. The one advantage we have with compiler wrappers is that we can make sure our flags are always first, or last, whichever causes the build-system flags to be overwritten.
Create a --cuda-ptx-architectures flag. PTX code will be included in the binary for these virtual architectures. E.g. if --cuda-ptx-architectures=8.0, EasyBuild will try to make sure CUDA code is compiled with --generate-code=arch=compute_80,code=compute_80
Expand the sanity check proposed in add a CUDA device code sanity check #4692 so that it fails if the requested virtual architecture is not present in one of the final CUDA binaries. Also, optionally, add a strict-ptx-sanity-check to make the sanity check also fail if too many virtual architectures are present.

The text was updated successfully, but these errors were encountered:

ocaisa · 2025-02-21T19:35:16Z

There are a couple of environment variables that we can probably leverage here (from https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#nvcc-environment-variables), NVCC_PREPEND_FLAGS an NVCC_APPEND_FLAGS

casparvl · 2025-02-21T19:38:05Z

Great, that'd mean we don't need compiler wrappers, we just need to set the appropriate NVCC_*_FLAGS whenever something uses CUDA, and --cuda-compute-capabilities and/or --cuda-ptx-architectures is set. Should be much easier to implement.

casparvl mentioned this issue Feb 21, 2025

add a CUDA device code sanity check #4692

Draft

boegel added the enhancement label Feb 26, 2025

boegel added this to the 4.x milestone Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The desired behavior of EasyBuild for `--cuda-compute-capabilities` is ill defined #4770

The desired behavior of EasyBuild for `--cuda-compute-capabilities` is ill defined #4770

casparvl commented Feb 21, 2025 •

edited

Loading

ocaisa commented Feb 21, 2025

casparvl commented Feb 21, 2025 •

edited

Loading

The desired behavior of EasyBuild for --cuda-compute-capabilities is ill defined #4770

The desired behavior of EasyBuild for --cuda-compute-capabilities is ill defined #4770

Comments

casparvl commented Feb 21, 2025 • edited Loading

ocaisa commented Feb 21, 2025

casparvl commented Feb 21, 2025 • edited Loading

The desired behavior of EasyBuild for `--cuda-compute-capabilities` is ill defined #4770

The desired behavior of EasyBuild for `--cuda-compute-capabilities` is ill defined #4770

casparvl commented Feb 21, 2025 •

edited

Loading

casparvl commented Feb 21, 2025 •

edited

Loading