-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] bad binary since cuda-11.3 #54633
Comments
I do not think that's not going to help you. The issue is apparently with NVIDIA almost never fixes bugs in released CUDA versions, so, if the bug is not fixed in CUDA-11.6.2, you will need to wait for a new CUDA release and hope that they fix it there. If you could create a small reproducer that could be used as a test one could compile and run, that would help me to file a bug with NVIDIA and get it fixed. Ideally a pure CUDA source, without having to use openmp. A reasonably small PTX source which assembles to something obviously wrong on the SASS level would work, too. |
@Artem-B thank you for the comment. Yes as a workaround, I'd like to explicitly specify ptxas from a different CUDA version, namely point to the 11.2 one while using 11.4 with all the rest. However right now if Clang recogonize 11.4, it outputs the assembly file with ptx version 7.4 and ptxas from 11.2 rejects it saying 7.4 is not supported. So If I can force outputing the PTX assembly files with 7.2 version label, I can implement a workaround for now forcing ptxas from 11.2 and using CUDA 11.4 in all the rest. Not much luck when I tried to create a small reproducer. Will try SASS. |
You could try overriding the PTX version with If you use a newer ptxas extra options should not be needed, but the produced cubin may not be compatible with the older CUDA runtime. |
Would you be able to reduce the reproducer? It would be great to figure out what triggers the issue and, if it's indeed a miscompile in ptxas, report it to NVIDIA so they can get it fixed. |
This issue affects complex number types and structures. A snippet of code with the issue
GradType is basically a struct with 3 doubles.
And, as originally posted, the assignment succeeds with CUDA 11.2.2, and fails with any later version of CUDA. The issue can be seen by adding printf's immediately after the assignment
Will produce the output
|
The type of the index in the outer loop over nw seems to matter. If the type is changed from |
The PTX looks fine, but the incorrect code is visible in the SASS. One oddity is the load of the index variable to do the offset computation for ratioGradRef_list_ptr. With a 32 bit index variable, it does a 32 bit load. With a 64 bit index variable, it does a 128 bit load (???). In some versions of CUDA, a 64 bit load here would result in correct behavior because it wouldn't overwrite a certain register, but I think that might be accidental. (At least assuming semantics of SASS based on mnemonic name and structure) The PTX from clang, for the 'good' case (32 bit index variable):
For the 'bad' case (64 bit index variable), the index variable
Now, the SASS for the 'bad' case (64 bit index variable) (Using CUDA 11.6):
Notice at the end, there are 2 64-bit loads into R8 and R10, but 3 64-bit stores from R6, R8, and R10. The load into R6 is missing. Now the SASS for the 'good' case (32 bit index variable) (using nvdisasm from CUDA 11.6):
At the end, there is a 64 bit load into R6 and a 128 bit load into R8 and R10, and 3 64-bit stores from R6, R8, and R10. |
The function where the faulty assignment-after-reduction occurs starts here: https://github.com/QMCPACK/qmcpack/blob/bb8c8872ff710260c28a1937ceefb7d3e06737ca/src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.2.cpp#L663 |
This may be a ptxas bug. The SASS snippets appear to do a bit more than the PTX snippets. Can you post complete PTX and SASS disassembly for the function for good/bad cases on gist.github.com ? Maybe LLVM IR as well -- makes it easier to reproduce the issue if it turns out to be on the LLVM side. Just in case, which clang/LLVM version are you using? |
These code fragments used clang version 18.0.0 (https://github.com/llvm/llvm-project.git 80c01dd) |
PTX and SASS for the function MultiDiracDeterminant::mw_evaluateDetsAndGradsForPtclMove https://gist.github.com/markdewing/12143bb6679c977a5191280fc909f31e Look for the variable ratioGradRef_local These use CUDA 12.2 and clang version 18.0.0 (https://github.com/llvm/llvm-project.git 6f5b372) |
Additional data point: A version of the code that uses floats instead of doubles has a similar issue. Assigning the first two values (64 bits worth of data) after the reduction works correctly, but assigning the third value in the array does not (value ends up as zero). |
A few more things to try:
On a side note, generated ptxas appears to use local stores. It's possible that with more aggressive optimization settings (specifically with increased loop unroll threshold) it may be possible to simplify the function control flow, which may avoid triggering the problem in ptxas. |
The issue appears to be fixed in cuda 12.3. My current guess is it is an issue with combining loads - two consecutive 64 bit loads get combined into a single 128 bit load. But it picks the wrong variable or register. When the index variable is the same size as the underlying data type of the other variable, they are both in the same bucket of loads to consider combining, and the optimization pass picks the wrong one. If the index variable is not the same size, then the index variable is no longer in the same bucket, and the optimization pass can't pick the wrong one. I did create a reproducer. The essential feature was demoting the loop index variable |
Thank you for figuring it out. I wish NVIDIA would provide more public info about known issues and which ones are fixed in a particular release, so we don't stumble around in the dark debugging the issues they already know about. |
My original issues has been resolved. Thank @markdewing for figuring out all the low level stuff. |
My application fails with wrong numbers when using cuda >= 11.3 toolchain. Checked up to 11.6
source and assembly code
badcubin.zip
both sides are std::complex. Bad binary caused the imaginary part of the left hand side has value 0.
--save-temps
assembly files from CUDA 11.2 and 11.3, they differ only byIf I compile the whole application with CUDA 11.3 toolchain, test fails. Since my application is OpenMP offload, the nvptx pass invokes ptxas, If I use ptxas from CUDA 11.2 to generate cubin for the failing file and all the rest uses CUDA 11.3. my test passes.
So my guess is the nvptx backend and ptxas (>=7.3) have some incompatibility and caused bad binary. I just leave my analysis here, hopefully backend experts will have more ideas.
Q: Is there a way to force clang to generate assembly files with a different PTX version? In this way combined with
--ptxas-path
, I can use an alternative ptxas while the rest remains with the primary CUDA toolkit I need to use.The text was updated successfully, but these errors were encountered: