Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The desired behavior of EasyBuild for --cuda-compute-capabilities is ill defined #4770

Open
casparvl opened this issue Feb 21, 2025 · 2 comments
Milestone

Comments

@casparvl
Copy link
Contributor

casparvl commented Feb 21, 2025

The issue

I've considered this before, and written things out here. I'll repeat key things here, and propose improvements.

The fundamental issue is that a single --cuda-compute-capabilities does not represent the complexity of CUDA code compilation.

CUDA code is compiled in two stages. First it is compiled into a in intermediate representation, PTX, which can be considered as assembly for a virtual GPU architecture. A virtual architecture is defined by a set of capabilities and features it can provide to the application. PTX is essentially in text-format, there is no binary encoding here. The second stage is that the intermediate representation is compiled into device code, i.e. a binary. See this documentation and this image.

To complicate things further, CUDA support JIT compilation from the PTX code into device code, see this section and this image. That requires the PTX code to be included in the binary - which is optional.

As a result, it is unclear what behavior one should expect when setting --cuda-compute-capabilities=8.0,9.0 for example. Probably everyone expects device code to be built for the real architectures sm_80 and sm_90 . What is undefined is what virtual architecture is used to get there. E.g. both

nvcc hello.cu --gpu-architecture=compute_80 --gpu-code=sm_80,sm_90 -o hello

and

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 -o hello

will give a binary for the architectures sm_80 and sm_90. The difference being that in the second case, the intermediate representation used to compile the sm_90 device code was the compute_90 virtual architecture, whereas in the first case the compute_80 virtual architecture is used. The second code could be more performant, if features unique to the compute_90 virtual architecture are used that are not present in the compute_80 virtual architecture.

Another thing that is undefined is which, if any, PTX code will be shipped in the binary. If it does

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 -o hello

there will be no PTX code in the binary. But it could also do

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_90,code=compute_90 -o hello

or even

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_80,code=compute_80 --generate-code=arch=compute_90,code=compute_90 -o hello

(although there is little point in shipping both CC80 and CC90 PTX codes if both the CC80 and CC90 device code is present).

Of course, note that even if we clearly define what we want --cuda-compute-capabilities to mean in EasyBuild, we may not be able to always achieve that with the build system. Some codes can e.g. only be compiled for a single architecture, so in that case a --cuda-compute-capability=8.0,9.0 is 'impossible' anyway. Or, there might be codes (e.g. the CUDA-Samples) for which you pass the desired compute capabilities as an argument and then they do something with that that is defined by their own build system (in the case of CUDA-Samples, several binaries are not compiled with the capabilities passed to the SMS argument of their make command). But: those issues are not fundamentally different from e.g. a build system that doesn't properly respect an optarch setting. We try to patch them, and if we can't, we document the limitation in the EasyConfig/EasyBlock, and that's it.

Proposed solution

There are a couple of actions needed here:

  1. Improve documentation on --cuda-compute-capabilities to state what this should do. For example (and I think this IS what it should do): it will create device code (i.e. binary code for a real GPU architecture) for the requested compute capability, using the virtual architecture from that same compute capability. I.e. for 8.0,9.0 it will try to pass the flags --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90. We will be able to check at least which device code is generated with a sanity check - as is being implemented in add a CUDA device code sanity check #4692 (unfortunately, we can't check for the virtual compute capability that was used to do so)
  2. We need to create a robust way to pass those flags. Currently, we rely on being able to somehow ask the build system to build for a certain arch. Unfortunately, there is no equivalent to how we set certain CFLAGS based on the optarch (which tends to work for pretty much any build system), as there are no nvidia-specific flags that are conventionally respected by build systems (i.e. there is no standardized NVCCFLAGS or something). And we can't add them to CFLAGS as that would make regular C-compiler invocations fail, since they won't know those arguments. One way could be to make an nvcc compiler wrapper, and inject the flags that way. This is probably the most robust and generic way to get these flags passed to all nvcc invocations. Note that the build system might pass contradictory flags, which may cause the wrapped flags to be ignored, or worse, they might conflict and cause failures. That's something we'll have to try, see how often it happens, and see if it can be dealt with (e.g. patch the build system to omit it's own flags). Also note that something similar happens for CPU: many build systems pass -O<something> flags, that sometimes overwrite what EB users set in optarch. The one advantage we have with compiler wrappers is that we can make sure our flags are always first, or last, whichever causes the build-system flags to be overwritten.
  3. Create a --cuda-ptx-architectures flag. PTX code will be included in the binary for these virtual architectures. E.g. if --cuda-ptx-architectures=8.0, EasyBuild will try to make sure CUDA code is compiled with --generate-code=arch=compute_80,code=compute_80
  4. Expand the sanity check proposed in add a CUDA device code sanity check #4692 so that it fails if the requested virtual architecture is not present in one of the final CUDA binaries. Also, optionally, add a strict-ptx-sanity-check to make the sanity check also fail if too many virtual architectures are present.
@ocaisa
Copy link
Member

ocaisa commented Feb 21, 2025

There are a couple of environment variables that we can probably leverage here (from https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#nvcc-environment-variables), NVCC_PREPEND_FLAGS an NVCC_APPEND_FLAGS

@casparvl
Copy link
Contributor Author

casparvl commented Feb 21, 2025

Great, that'd mean we don't need compiler wrappers, we just need to set the appropriate NVCC_*_FLAGS whenever something uses CUDA, and --cuda-compute-capabilities and/or --cuda-ptx-architectures is set. Should be much easier to implement.

@boegel boegel added this to the 4.x milestone Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants