[CUDA] LLVM optimization passes can ignore cuda memory barrier #1258

againull · 2020-03-06T06:37:12Z

LowerWGScope pass generates copies between private and shared memory.
Logic is to share private value from leader work item to other work
items through shared memory. Example in pseudo code:

...
if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
load %PrivateValue from @SharedGlobal -> all WIs load the shared value
...

Generated load/store operations are not supposed to be moved across
memory barrier but barrier intrinsics like @llvm.nvvm.barrier0() are
not handled specially by LLVM middle end passes and recognized only by
PTX backend. So, middle end optimizations could perform code movement
resulting in load before store. For example, GVN could perform LoadPRE
based on GlobalsAA:

...
crit_edge:
  load %PrivateValue from @SharedGlobal -> all WIs load the shared value

if (Leader work item)
  store %PrivateValue to @SharedGlobal -> leader shares the value
memory_barrier()
...

It turns out that LLVM does not really have barrier intrinsics, it requires the "fence" instruction.
It looks like, for example, this barrier intrinsic llvm.nvvm.barrier0() call is only recognized by the PTX backend. In the LLVM middle end, it just looks like a regular llvm intrinsic,

I have attached real example.
Input IR:
before_gvn.txt
Ouput IR:
after_gvn.txt

opt -globals-aa -gvn before_gvn.ll -S > after_gvn.ll
vimdiff before_gvn.ll after_gvn.ll

The text was updated successfully, but these errors were encountered:

againull · 2020-06-08T23:25:24Z

Summary of the problem after investigation:
Problem is that PTX barrier is implemented as an LLVM intrinsic llvm.nvvm.barrier0 and Globals AA doesn’t handle it in a specific way but as a regular LLVM intrinsic. Globals AA can prove that llvm.nvvm.barrier0() can only read internal globals. As a result, illegal reordering of memory accesses is performed by transformations like GVN. Problem is not specific for SYCL and bug is not in the LowerWGScope pass that generates IR to perform hierarchical parallelism semantics.
As a prove let me provide an example in OpenCL where illegal transformation happens: https://godbolt.org/z/6hJxTP

Problem is not reproduced for CUDA just because store and load instructions have addrspacecast constant expression as an argument and not global itself. GlobalsAA is just not taught to deal with this addrspacecast: https://godbolt.org/z/aVGixD
If I remove these addrspacecast instructions manually then illegal transformation is also performed:
https://godbolt.org/z/cg2g6h

AR: Problem should be fixed in llvm project: https://github.com/llvm/llvm-project/tree/master/llvm and as far as I undestand @Naghasan is going to work on this.

Preparing and committing proper fix to https://github.com/llvm/llvm-project/tree/master/llvm can take some time, so workaround was committed to intel/llvm to enable hierarchical parallelism tests on PTX backend:
#1334

AerialMantis · 2022-02-21T10:01:43Z

We investigated this issue and explored possible solutions, but in the end came to the conclusion that the issue was not specific to the NVPTX backend and was a more general problem for synchronization intrinsics on hierarchical parallel architectures.

We also found an LLVM mailing list which raised the same problem in the GVN pass with regular LLVM IR synchronization intrinsics https://discourse.llvm.org/t/bug-gvn-memdep-bug-in-the-presence-of-intrinsics/59402

We raised this as a topic for discussion in the second LLVM GPU working group meeting (https://docs.google.com/document/d/1m_oSe1HwtWdQ2JUmMRTAVHbUS7Dv4MRsqptiYcgK6iI/edit#heading=h.xgjl2srtytjt) attended by @jchlanda. It was discussed that it was believed this issue had already been fixed in the past and that it should be addressed by barrier intrinsics having the nosync attribute and the GVN pass recognise this. Johannes Doerfert said that he would open a ticket on the LLVM Github and might be able to look into it.

AerialMantis · 2022-02-28T23:35:58Z

Tagging this issue as in progress by 3rd party as it is being investigated by the upstream LLVM community, we will monitor this and report back any progress.

rodburns · 2023-06-23T09:45:38Z

Closing this issue as it has been addressed by this LLVM issue llvm/llvm-project#54851

againull changed the title ~~LLVM optimization passes which ignore memory barrier~~ LLVM optimization passes can ignore memory barrier Mar 6, 2020

againull added bug Something isn't working question Further information is requested labels Mar 6, 2020

This was referenced Mar 6, 2020

[SYCL] Generate volatile load/store/memcpy instructions in LowerWGScope #1257

Closed

[SYCL] WG-shared global variables must have external linkage #1279

Closed

againull added the cuda CUDA back-end label Jun 8, 2020

againull changed the title ~~LLVM optimization passes can ignore memory barrier~~ [CUDA] LLVM optimization passes can ignore cuda memory barrier Jun 8, 2020

againull removed the question Further information is requested label Jun 8, 2020

bader mentioned this issue Jun 19, 2020

[SYCL][CUDA] Remove unnecessary memfence #1935

Merged

AerialMantis added the compiler Compiler related issue label Aug 23, 2021

jchlanda mentioned this issue Oct 26, 2021

[WIP] Introduce disjoint agents fn attribute/intrinsic property #4821

Closed

AerialMantis added the 3rd party 3rd party software issue label Feb 28, 2022

rodburns closed this as completed Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] LLVM optimization passes can ignore cuda memory barrier #1258

[CUDA] LLVM optimization passes can ignore cuda memory barrier #1258

againull commented Mar 6, 2020 •

edited

Loading

againull commented Jun 8, 2020 •

edited

Loading

AerialMantis commented Feb 21, 2022

AerialMantis commented Feb 28, 2022

rodburns commented Jun 23, 2023

[CUDA] LLVM optimization passes can ignore cuda memory barrier #1258

[CUDA] LLVM optimization passes can ignore cuda memory barrier #1258

Comments

againull commented Mar 6, 2020 • edited Loading

againull commented Jun 8, 2020 • edited Loading

AerialMantis commented Feb 21, 2022

AerialMantis commented Feb 28, 2022

rodburns commented Jun 23, 2023

againull commented Mar 6, 2020 •

edited

Loading

againull commented Jun 8, 2020 •

edited

Loading