[Performance]: Add weaker memory fence for custom allreduce #8457

HydraQYH · 2024-09-13T09:32:34Z

from here: #8404

I try to use weaker memory fence in custom allreduce implement. This is the code: 4cdb581

I will explain the principle of my code for shot, this is a program flow chart:

Step3 in purple block has a acquire pattern, step 5 in purple block has a release pattern.

The order 3->4->5 is guaranteed by acquire & release.
1 may be reordered to between 345(acquire), but it must be before 5(release), so it does not affect the subsequent end flag judgment.
6 may be reordered to between 345(release), but it must be after 3(acquire), so it does not affect the judgment of the previous starting flag.
There is no fence between 2 and 3, but 2 must be globally visible before 3 can jump out of the loop (implicitly agreed upon 2->3), and then 3(2)->4->5 can be globally visible.
There is no fence between 5 and 7, but 5 must be globally visible before 7 can jump out of the loop (implicitly agreed upon 5->7). Based on the constraint of 4->5(7), it can be guaranteed that the pull data has been executed.

Report of performance regression

Baseline(no memory fence):

My Code:

It seems like there are 1~3us latency on my code, that is for memory fence.

Misc discussion on performance

I also try to use this plan: #8410 (comment)
That is the performance:

I think we have to solve #8410 first, confirm whether the memory order problem really exists, and then look at this implementation.
@youkaichao @hanzhi713

TensorRT-LLM use acquire for every ld and release for every st. I test it, it will cause about 6us latency.
https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu#L162

The text was updated successfully, but these errors were encountered:

HydraQYH added the performance Performance-related issues label Sep 13, 2024

HydraQYH mentioned this issue Sep 13, 2024

[Misc]: Memory Order in Custom Allreduce #8404

Closed

HydraQYH closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: Add weaker memory fence for custom allreduce #8457

[Performance]: Add weaker memory fence for custom allreduce #8457

HydraQYH commented Sep 13, 2024 •

edited

Loading

[Performance]: Add weaker memory fence for custom allreduce #8457

[Performance]: Add weaker memory fence for custom allreduce #8457

Comments

HydraQYH commented Sep 13, 2024 • edited Loading

Report of performance regression

Misc discussion on performance

HydraQYH commented Sep 13, 2024 •

edited

Loading