The desired behavior of EasyBuild for --cuda-compute-capabilities
is ill defined
#4770
Labels
Milestone
--cuda-compute-capabilities
is ill defined
#4770
The issue
I've considered this before, and written things out here. I'll repeat key things here, and propose improvements.
The fundamental issue is that a single
--cuda-compute-capabilities
does not represent the complexity of CUDA code compilation.CUDA code is compiled in two stages. First it is compiled into a in intermediate representation, PTX, which can be considered as assembly for a virtual GPU architecture. A virtual architecture is defined by a set of capabilities and features it can provide to the application. PTX is essentially in text-format, there is no binary encoding here. The second stage is that the intermediate representation is compiled into device code, i.e. a binary. See this documentation and this image.
To complicate things further, CUDA support JIT compilation from the PTX code into device code, see this section and this image. That requires the PTX code to be included in the binary - which is optional.
As a result, it is unclear what behavior one should expect when setting
--cuda-compute-capabilities=8.0,9.0
for example. Probably everyone expects device code to be built for the real architecturessm_80
andsm_90
. What is undefined is what virtual architecture is used to get there. E.g. bothand
will give a binary for the architectures
sm_80
andsm_90
. The difference being that in the second case, the intermediate representation used to compile thesm_90
device code was thecompute_90
virtual architecture, whereas in the first case thecompute_80
virtual architecture is used. The second code could be more performant, if features unique to thecompute_90
virtual architecture are used that are not present in thecompute_80
virtual architecture.Another thing that is undefined is which, if any, PTX code will be shipped in the binary. If it does
there will be no PTX code in the binary. But it could also do
or even
(although there is little point in shipping both CC80 and CC90 PTX codes if both the CC80 and CC90 device code is present).
Of course, note that even if we clearly define what we want
--cuda-compute-capabilities
to mean in EasyBuild, we may not be able to always achieve that with the build system. Some codes can e.g. only be compiled for a single architecture, so in that case a--cuda-compute-capability=8.0,9.0
is 'impossible' anyway. Or, there might be codes (e.g. the CUDA-Samples) for which you pass the desired compute capabilities as an argument and then they do something with that that is defined by their own build system (in the case of CUDA-Samples, several binaries are not compiled with the capabilities passed to theSMS
argument of theirmake
command). But: those issues are not fundamentally different from e.g. a build system that doesn't properly respect anoptarch
setting. We try to patch them, and if we can't, we document the limitation in the EasyConfig/EasyBlock, and that's it.Proposed solution
There are a couple of actions needed here:
--cuda-compute-capabilities
to state what this should do. For example (and I think this IS what it should do): it will create device code (i.e. binary code for a real GPU architecture) for the requested compute capability, using the virtual architecture from that same compute capability. I.e. for8.0,9.0
it will try to pass the flags--generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90
. We will be able to check at least which device code is generated with a sanity check - as is being implemented in add a CUDA device code sanity check #4692 (unfortunately, we can't check for the virtual compute capability that was used to do so)CFLAGS
based on theoptarch
(which tends to work for pretty much any build system), as there are no nvidia-specific flags that are conventionally respected by build systems (i.e. there is no standardizedNVCCFLAGS
or something). And we can't add them toCFLAGS
as that would make regular C-compiler invocations fail, since they won't know those arguments. One way could be to make annvcc
compiler wrapper, and inject the flags that way. This is probably the most robust and generic way to get these flags passed to allnvcc
invocations. Note that the build system might pass contradictory flags, which may cause the wrapped flags to be ignored, or worse, they might conflict and cause failures. That's something we'll have to try, see how often it happens, and see if it can be dealt with (e.g. patch the build system to omit it's own flags). Also note that something similar happens for CPU: many build systems pass-O<something>
flags, that sometimes overwrite what EB users set inoptarch
. The one advantage we have with compiler wrappers is that we can make sure our flags are always first, or last, whichever causes the build-system flags to be overwritten.--cuda-ptx-architectures
flag. PTX code will be included in the binary for these virtual architectures. E.g. if--cuda-ptx-architectures=8.0
, EasyBuild will try to make sure CUDA code is compiled with--generate-code=arch=compute_80,code=compute_80
strict-ptx-sanity-check
to make the sanity check also fail if too many virtual architectures are present.The text was updated successfully, but these errors were encountered: