Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Add weaker memory fence for custom allreduce #8457

Closed
HydraQYH opened this issue Sep 13, 2024 · 0 comments
Closed

[Performance]: Add weaker memory fence for custom allreduce #8457

HydraQYH opened this issue Sep 13, 2024 · 0 comments
Labels
performance Performance-related issues

Comments

@HydraQYH
Copy link

HydraQYH commented Sep 13, 2024

from here: #8404

I try to use weaker memory fence in custom allreduce implement. This is the code: 4cdb581

I will explain the principle of my code for shot, this is a program flow chart:
FlowChart

Step3 in purple block has a acquire pattern, step 5 in purple block has a release pattern.

  • The order 3->4->5 is guaranteed by acquire & release.
  • 1 may be reordered to between 345(acquire), but it must be before 5(release), so it does not affect the subsequent end flag judgment.
  • 6 may be reordered to between 345(release), but it must be after 3(acquire), so it does not affect the judgment of the previous starting flag.
  • There is no fence between 2 and 3, but 2 must be globally visible before 3 can jump out of the loop (implicitly agreed upon 2->3), and then 3(2)->4->5 can be globally visible.
  • There is no fence between 5 and 7, but 5 must be globally visible before 7 can jump out of the loop (implicitly agreed upon 5->7). Based on the constraint of 4->5(7), it can be guaranteed that the pull data has been executed.

Report of performance regression

Baseline(no memory fence):
issue_0
My Code:
issue_1
It seems like there are 1~3us latency on my code, that is for memory fence.

Misc discussion on performance

I also try to use this plan: #8410 (comment)
That is the performance:
issue_2

I think we have to solve #8410 first, confirm whether the memory order problem really exists, and then look at this implementation.
@youkaichao @hanzhi713

TensorRT-LLM use acquire for every ld and release for every st. I test it, it will cause about 6us latency.
https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu#L162

@HydraQYH HydraQYH added the performance Performance-related issues label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

1 participant