You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to use weaker memory fence in custom allreduce implement. This is the code: 4cdb581
I will explain the principle of my code for shot, this is a program flow chart:
Step3 in purple block has a acquire pattern, step 5 in purple block has a release pattern.
The order 3->4->5 is guaranteed by acquire & release.
1 may be reordered to between 345(acquire), but it must be before 5(release), so it does not affect the subsequent end flag judgment.
6 may be reordered to between 345(release), but it must be after 3(acquire), so it does not affect the judgment of the previous starting flag.
There is no fence between 2 and 3, but 2 must be globally visible before 3 can jump out of the loop (implicitly agreed upon 2->3), and then 3(2)->4->5 can be globally visible.
There is no fence between 5 and 7, but 5 must be globally visible before 7 can jump out of the loop (implicitly agreed upon 5->7). Based on the constraint of 4->5(7), it can be guaranteed that the pull data has been executed.
Report of performance regression
Baseline(no memory fence):
My Code:
It seems like there are 1~3us latency on my code, that is for memory fence.
Misc discussion on performance
I also try to use this plan: #8410 (comment)
That is the performance:
I think we have to solve #8410 first, confirm whether the memory order problem really exists, and then look at this implementation. @youkaichao@hanzhi713
from here: #8404
I try to use weaker memory fence in custom allreduce implement. This is the code: 4cdb581
I will explain the principle of my code for shot, this is a program flow chart:
Step3 in purple block has a acquire pattern, step 5 in purple block has a release pattern.
Report of performance regression
Baseline(no memory fence):
My Code:
It seems like there are 1~3us latency on my code, that is for memory fence.
Misc discussion on performance
I also try to use this plan: #8410 (comment)
That is the performance:
I think we have to solve #8410 first, confirm whether the memory order problem really exists, and then look at this implementation.
@youkaichao @hanzhi713
TensorRT-LLM use acquire for every ld and release for every st. I test it, it will cause about 6us latency.
https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu#L162
The text was updated successfully, but these errors were encountered: