-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] LLVM optimization passes can ignore cuda memory barrier #1258
Comments
Summary of the problem after investigation: Problem is not reproduced for CUDA just because store and load instructions have addrspacecast constant expression as an argument and not global itself. GlobalsAA is just not taught to deal with this addrspacecast: https://godbolt.org/z/aVGixD AR: Problem should be fixed in llvm project: https://github.com/llvm/llvm-project/tree/master/llvm and as far as I undestand @Naghasan is going to work on this. Preparing and committing proper fix to https://github.com/llvm/llvm-project/tree/master/llvm can take some time, so workaround was committed to intel/llvm to enable hierarchical parallelism tests on PTX backend: |
We investigated this issue and explored possible solutions, but in the end came to the conclusion that the issue was not specific to the NVPTX backend and was a more general problem for synchronization intrinsics on hierarchical parallel architectures. We also found an LLVM mailing list which raised the same problem in the GVN pass with regular LLVM IR synchronization intrinsics https://discourse.llvm.org/t/bug-gvn-memdep-bug-in-the-presence-of-intrinsics/59402 We raised this as a topic for discussion in the second LLVM GPU working group meeting (https://docs.google.com/document/d/1m_oSe1HwtWdQ2JUmMRTAVHbUS7Dv4MRsqptiYcgK6iI/edit#heading=h.xgjl2srtytjt) attended by @jchlanda. It was discussed that it was believed this issue had already been fixed in the past and that it should be addressed by barrier intrinsics having the nosync attribute and the GVN pass recognise this. Johannes Doerfert said that he would open a ticket on the LLVM Github and might be able to look into it. |
Tagging this issue as in progress by 3rd party as it is being investigated by the upstream LLVM community, we will monitor this and report back any progress. |
Closing this issue as it has been addressed by this LLVM issue llvm/llvm-project#54851 |
LowerWGScope pass generates copies between private and shared memory.
Logic is to share private value from leader work item to other work
items through shared memory. Example in pseudo code:
Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
not handled specially by LLVM middle end passes and recognized only by
PTX backend. So, middle end optimizations could perform code movement
resulting in load before store. For example, GVN could perform LoadPRE
based on GlobalsAA:
It turns out that LLVM does not really have barrier intrinsics, it requires the "fence" instruction.
It looks like, for example, this barrier intrinsic llvm.nvvm.barrier0() call is only recognized by the PTX backend. In the LLVM middle end, it just looks like a regular llvm intrinsic,
I have attached real example.
Input IR:
before_gvn.txt
Ouput IR:
after_gvn.txt
The text was updated successfully, but these errors were encountered: