Improve memory ordering of sync-free kernels #1344

upsj · 2023-06-06T13:05:14Z

This adds load_relaxed(_shared), load_acquire(_shared) and store_relaxed(_shared) and store_release(_shared) functions to provide limited atomic load/store support in NVIDIA GPUs.

TODO

Figure out deadlock on post-Volta

common/cuda_hip/components/syncfree.hpp.inc

sonarqubecloud · 2023-07-13T02:53:46Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

0.0% Coverage
0.0% Duplication

thoasm

I would like to have some small documentation and explanation inside the memory.cuh file to clarify things.
Also, I feel like the names of the functions aren't accurate as they do different things depending on the architecture.
But I am a fan of these new functions!

cuda/components/memory.cuh

dev_tools/scripts/generate_cuda_memory_ptx.py

thoasm

Thanks for addressing all my comments.

cuda/components/memory.cuh

thoasm

LGTM!

cuda/solver/common_trs_kernels.cuh

yhmtsai · 2023-10-09T07:53:27Z

common/cuda_hip/components/syncfree.hpp.inc

            }
        }
-        __threadfence();
+        group::tiled_partition<subwarp_size>(group::this_thread_block()).sync();


does it need sync warp?

yes, since only a single lane is waiting for the data here, we need to make sure the other threads wait here as well. It might be necessary to keep a threadfence here as well, though IIRC syncwarp tends to do that implicitly, or at least all threads in the warp use the same cache, so any cache flush or similar done on one thread should impact all threads.

the other threads also has the same dependency here, right?

They inherit the dependency from the lane 0 because they wait for it

yhmtsai · 2023-10-09T08:01:33Z

cuda/components/memory.cuh

+    return __nvvm_get_smem_pointer(ptr);
+#else
+    uint32 smem_ptr;
+    asm("{{ .reg .u64 smem_ptr; cvta.to.shared.u64 smem_ptr, %1; cvt.u32.u64 "


Maybe it is a stupid question: from https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#incorrect-optimization, volatile ensures it is not deleted or moved. I think the location of this ptx does not affect anything, but if it is deleted? or does the delete possibility only happen in combination(optimization) or no output?

it can only be deleted if the optimizer manages to remove the dependency on the value the assembly computes. Since the following load/store is volatile, it cannot be optimized away.

cuda/components/memory.cuh

yhmtsai · 2023-10-09T08:34:20Z

cuda/components/memory.cuh

+#include "common/cuda_hip/components/memory.hpp.inc"
+
+
+__device__ __forceinline__ int32 load_relaxed_shared(const int32* ptr)


the following codes are repeated up to the type/shared and the corresponding PTX (except for complex)
maybe macros like LOAD_ACQUIRE(TYPE, PTX_TYPE) -> give load_* and load_*_shared, which may require moving CUDA_ARCH macro out of this kind of macro.

ah, you did that in python. Isn't macro enough for that?

I think a macro would be worse in terms of readability, since we are building strings out of many different components. Python allows us to at least give everything names in the template and parameter set (i.e. both where we define the macro and where we call it).

macro should still give name for the parameter although you need to specify all instantiation manually not from for loop.
The python was an issue for me about the generated code and source.
I first review this file and try to figure out whether there's a missing combination. Then figure out there's another python file for it. There's no strong connection between python and generated code especially when it is the final code not the intermediate state.
Could you at least add some comment about the code is generated by the python file?

what I mean is that in Python, you have named arguments space(ptx_space_suffix=".shared", ...) while in preprocessor macros, you only have the argument order SPACE(.shared, ...), which is harder to maintain and read.

cuda/components/memory.cuh

dev_tools/scripts/generate_cuda_memory_ptx.py

yhmtsai · 2023-10-10T07:12:09Z

cuda/components/memory.cuh

+#include "common/cuda_hip/components/memory.hpp.inc"
+
+
+__device__ __forceinline__ int32 load_relaxed_shared(const int32* ptr)


macro should still give name for the parameter although you need to specify all instantiation manually not from for loop.
The python was an issue for me about the generated code and source.
I first review this file and try to figure out whether there's a missing combination. Then figure out there's another python file for it. There's no strong connection between python and generated code especially when it is the final code not the intermediate state.
Could you at least add some comment about the code is generated by the python file?

yhmtsai · 2023-10-10T07:31:59Z

common/cuda_hip/components/syncfree.hpp.inc

            }
        }
-        __threadfence();
+        group::tiled_partition<subwarp_size>(group::this_thread_block()).sync();


the other threads also has the same dependency here, right?

- const-correctness - add doc to generic-to-shared ptr conversion - improve generation script readability Co-authored-by: Marcel Koch <marcel.koch@kit.edu> Co-authored-by: Thomas Grützmacher <thomas.gruetzmacher@kit.edu>

- update asm type annotations - fix incorrect store Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

upsj · 2023-10-10T11:58:28Z

I'll go ahead and merge this already, since only DPC++ and OpenMP pipelines are outstanding, and those files were unmodified. Then we can move forward the other PRs soon.

sonarqubecloud · 2023-10-11T00:21:40Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

The version of Java (11.0.3) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

As a follow-up to #1344, this replaces `volatile` operations by proper memory ordering in HIP. Related PR: #1472

upsj self-assigned this Jun 6, 2023

upsj added the 1:ST:WIP This PR is a work in progress. Not ready for review. label Jun 6, 2023

upsj mentioned this pull request Jul 10, 2023

Column Cholesky #1366

Merged

upsj changed the title ~~Improve Cholesky performance~~ Improve memory ordering of sync-free kernels Jul 10, 2023

upsj changed the base branch from develop to column_cholesky July 10, 2023 14:54

upsj force-pushed the tune_cholesky branch from 888b2c1 to 85fed4a Compare July 12, 2023 16:29

upsj added the 1:ST:ready-for-review This PR is ready for review label Jul 12, 2023

upsj commented Jul 12, 2023

View reviewed changes

common/cuda_hip/components/syncfree.hpp.inc Outdated Show resolved Hide resolved

upsj force-pushed the column_cholesky branch 2 times, most recently from 2a2b5d3 to db38887 Compare July 25, 2023 15:29

Base automatically changed from column_cholesky to develop July 26, 2023 07:00

upsj requested a review from thoasm September 18, 2023 12:47

thoasm reviewed Sep 22, 2023

View reviewed changes

cuda/components/memory.cuh Show resolved Hide resolved

cuda/components/memory.cuh Show resolved Hide resolved

cuda/components/memory.cuh Show resolved Hide resolved

cuda/components/memory.cuh Outdated Show resolved Hide resolved

MarcelKoch reviewed Sep 22, 2023

View reviewed changes

dev_tools/scripts/generate_cuda_memory_ptx.py Outdated Show resolved Hide resolved

upsj force-pushed the tune_cholesky branch from 85fed4a to f0a24fd Compare September 22, 2023 20:36

upsj removed the 1:ST:WIP This PR is a work in progress. Not ready for review. label Sep 22, 2023

thoasm reviewed Sep 25, 2023

View reviewed changes

cuda/components/memory.cuh Outdated Show resolved Hide resolved

upsj requested a review from thoasm September 25, 2023 12:52

thoasm approved these changes Sep 25, 2023

View reviewed changes

upsj commented Sep 25, 2023

View reviewed changes

cuda/solver/common_trs_kernels.cuh Outdated Show resolved Hide resolved

yhmtsai reviewed Oct 9, 2023

View reviewed changes

upsj force-pushed the tune_cholesky branch from 4cf05df to a53567c Compare October 9, 2023 09:30

upsj requested a review from yhmtsai October 9, 2023 09:34

yhmtsai approved these changes Oct 10, 2023

View reviewed changes

upsj and others added 6 commits October 10, 2023 10:01

use advanced memory ordering instructions in CUDA

b915fa1

review updates

b6e4d4d

- const-correctness - add doc to generic-to-shared ptr conversion - improve generation script readability Co-authored-by: Marcel Koch <marcel.koch@kit.edu> Co-authored-by: Thomas Grützmacher <thomas.gruetzmacher@kit.edu>

restore peek functionality

5b6422a

use const_cast for CUDA atomic load/store wrappers

6505d06

remove unnecessary const casts

33b1de8

review updates

c6706ab

- update asm type annotations - fix incorrect store Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>

upsj added 1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Oct 10, 2023

upsj force-pushed the tune_cholesky branch from a53567c to c6706ab Compare October 10, 2023 08:01

add note to generated file

5acabea

upsj merged commit 6f65404 into develop Oct 10, 2023

upsj deleted the tune_cholesky branch October 10, 2023 11:59

upsj mentioned this pull request Nov 25, 2023

Replace volatile by proper memory ordering in HIP #1472

Merged

upsj added a commit that referenced this pull request Nov 30, 2023

Merge proper atomics for HIP

f2e0449

As a follow-up to #1344, this replaces `volatile` operations by proper memory ordering in HIP. Related PR: #1472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory ordering of sync-free kernels #1344

Improve memory ordering of sync-free kernels #1344

upsj commented Jun 6, 2023 •

edited

Loading

sonarqubecloud bot commented Jul 13, 2023

thoasm left a comment

thoasm left a comment

thoasm left a comment

yhmtsai Oct 9, 2023

upsj Oct 9, 2023 •

edited

Loading

yhmtsai Oct 10, 2023

upsj Oct 10, 2023

yhmtsai Oct 9, 2023

upsj Oct 9, 2023

yhmtsai Oct 9, 2023

yhmtsai Oct 9, 2023

upsj Oct 9, 2023

yhmtsai Oct 10, 2023

upsj Oct 10, 2023

yhmtsai Oct 10, 2023

yhmtsai Oct 10, 2023

upsj commented Oct 10, 2023

sonarqubecloud bot commented Oct 11, 2023

		#include "common/cuda_hip/components/memory.hpp.inc"


		__device__ __forceinline__ int32 load_relaxed_shared(const int32* ptr)

Improve memory ordering of sync-free kernels #1344

Improve memory ordering of sync-free kernels #1344

Conversation

upsj commented Jun 6, 2023 • edited Loading

sonarqubecloud bot commented Jul 13, 2023

thoasm left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upsj Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

upsj commented Oct 10, 2023

sonarqubecloud bot commented Oct 11, 2023

upsj commented Jun 6, 2023 •

edited

Loading

upsj Oct 9, 2023 •

edited

Loading