Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing data parallel Fuse-Allreduce-Overlapping #48092

Merged
merged 58 commits into from
Nov 29, 2022

Conversation

JZ-LIANG
Copy link
Contributor

@JZ-LIANG JZ-LIANG commented Nov 17, 2022

PR types

Performance optimization

PR changes

Others

Describe

Update1: Update the synchronization in DP-Overlapping from stream-synchronization to event-wait-record, which might reduce the synchronization overhead in scheduling.

Update2: Improve the after-allreduce-sync to allow a better fully Overlapping.
There two synchronization need by DP-Overlapping:

  1. before-allreduce-sync: allreduce comm need to wait gradient computation.
  2. after-allreduce-sync: the later usage of gradient should wait allreduce to be finished.
    BUT when combining Overlapping and Fusing, we lose the dependencies of allreduced gradients in the data flow graph since the Coalescence of gradients fusing.

The Common solution is to conduct the after-allreduce-sync right after fuse-allreduce (like before this PR and in Paddle Parallel Executor). But this way would lead to insufficiency of overlapping:
The CPU timeline: there is a "wait" right after nccl-sumarray
image

The GPU timeline: insufficiency of overlapping, the allreduce cloud only overlap with SumArray computation, but not with LayerNorm Backward.
image

In this PR, we resolute the exactly data flow dependency after Fusing, and instead performming the after-allreduce-sync immediately after allreduce, we put that wait where it should be (and as late as possible), which favor a better fully Overlapping.
The CPU timeline: the "right after" after-allreduce-sync is move to where it should be
image

The GPU timeline: better fully overlapping of allreduce and the later LayerNorm Backward kernel.
image

GPT3-1.3B-dp8, gbz=64 Mem (MB) tokens/sec
Before PR 25014 88503
After 25252 95420

to reproduce the performance you need another two PR:
support exe ctx in Comm op 48308
disable redundant dependency and prior comm op in standalone exe 48454

@JZ-LIANG JZ-LIANG changed the title [Auto Parallel-Optimization] Adapt Data Parallel for Graph executor [Auto Parallel-Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping when uses Graph executor Nov 28, 2022
@JZ-LIANG JZ-LIANG changed the title [Auto Parallel-Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping when uses Graph executor [Auto Parallel Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping Nov 28, 2022
@JZ-LIANG JZ-LIANG changed the title [Auto Parallel Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping [Auto Parallel Perf] Optimizing data parallel Fuse-Allreduce-Overlapping Nov 28, 2022
@JZ-LIANG JZ-LIANG changed the title [Auto Parallel Perf] Optimizing data parallel Fuse-Allreduce-Overlapping [Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping Nov 28, 2022
Copy link
Contributor

@aoyulong aoyulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JZ-LIANG JZ-LIANG merged commit 23e5b25 into PaddlePaddle:develop Nov 29, 2022
@JZ-LIANG JZ-LIANG changed the title [Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping Optimizing data parallel Fuse-Allreduce-Overlapping Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants