-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve memory ordering of sync-free kernels #1344
Conversation
Kudos, SonarCloud Quality Gate passed! |
2a2b5d3
to
db38887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have some small documentation and explanation inside the memory.cuh
file to clarify things.
Also, I feel like the names of the functions aren't accurate as they do different things depending on the architecture.
But I am a fan of these new functions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all my comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
} | ||
__threadfence(); | ||
group::tiled_partition<subwarp_size>(group::this_thread_block()).sync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need sync warp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, since only a single lane is waiting for the data here, we need to make sure the other threads wait here as well. It might be necessary to keep a threadfence here as well, though IIRC syncwarp tends to do that implicitly, or at least all threads in the warp use the same cache, so any cache flush or similar done on one thread should impact all threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other threads also has the same dependency here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They inherit the dependency from the lane 0 because they wait for it
return __nvvm_get_smem_pointer(ptr); | ||
#else | ||
uint32 smem_ptr; | ||
asm("{{ .reg .u64 smem_ptr; cvta.to.shared.u64 smem_ptr, %1; cvt.u32.u64 " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is a stupid question: from https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#incorrect-optimization, volatile
ensures it is not deleted or moved. I think the location of this ptx does not affect anything, but if it is deleted? or does the delete possibility only happen in combination(optimization) or no output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can only be deleted if the optimizer manages to remove the dependency on the value the assembly computes. Since the following load/store is volatile
, it cannot be optimized away.
#include "common/cuda_hip/components/memory.hpp.inc" | ||
|
||
|
||
__device__ __forceinline__ int32 load_relaxed_shared(const int32* ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the following codes are repeated up to the type/shared and the corresponding PTX (except for complex)
maybe macros like LOAD_ACQUIRE(TYPE, PTX_TYPE) -> give load_* and load_*_shared, which may require moving CUDA_ARCH macro out of this kind of macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you did that in python. Isn't macro enough for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a macro would be worse in terms of readability, since we are building strings out of many different components. Python allows us to at least give everything names in the template and parameter set (i.e. both where we define the macro and where we call it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macro should still give name for the parameter although you need to specify all instantiation manually not from for loop.
The python was an issue for me about the generated code and source.
I first review this file and try to figure out whether there's a missing combination. Then figure out there's another python file for it. There's no strong connection between python and generated code especially when it is the final code not the intermediate state.
Could you at least add some comment about the code is generated by the python file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I mean is that in Python, you have named arguments space(ptx_space_suffix=".shared", ...)
while in preprocessor macros, you only have the argument order SPACE(.shared, ...)
, which is harder to maintain and read.
#include "common/cuda_hip/components/memory.hpp.inc" | ||
|
||
|
||
__device__ __forceinline__ int32 load_relaxed_shared(const int32* ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
macro should still give name for the parameter although you need to specify all instantiation manually not from for loop.
The python was an issue for me about the generated code and source.
I first review this file and try to figure out whether there's a missing combination. Then figure out there's another python file for it. There's no strong connection between python and generated code especially when it is the final code not the intermediate state.
Could you at least add some comment about the code is generated by the python file?
} | ||
} | ||
__threadfence(); | ||
group::tiled_partition<subwarp_size>(group::this_thread_block()).sync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other threads also has the same dependency here, right?
- const-correctness - add doc to generic-to-shared ptr conversion - improve generation script readability Co-authored-by: Marcel Koch <marcel.koch@kit.edu> Co-authored-by: Thomas Grützmacher <thomas.gruetzmacher@kit.edu>
- update asm type annotations - fix incorrect store Co-authored-by: Yuhsiang M. Tsai <yhmtsai@gmail.com>
I'll go ahead and merge this already, since only DPC++ and OpenMP pipelines are outstanding, and those files were unmodified. Then we can move forward the other PRs soon. |
Kudos, SonarCloud Quality Gate passed!
|
This adds load_relaxed(_shared), load_acquire(_shared) and store_relaxed(_shared) and store_release(_shared) functions to provide limited atomic load/store support in NVIDIA GPUs.
TODO