Optimizing data parallel Fuse-Allreduce-Overlapping #48092

JZ-LIANG · 2022-11-17T07:58:17Z

PR types

Performance optimization

PR changes

Others

Describe

Update1: Update the synchronization in DP-Overlapping from stream-synchronization to event-wait-record, which might reduce the synchronization overhead in scheduling.

Update2: Improve the after-allreduce-sync to allow a better fully Overlapping.
There two synchronization need by DP-Overlapping:

before-allreduce-sync: allreduce comm need to wait gradient computation.
after-allreduce-sync: the later usage of gradient should wait allreduce to be finished.
BUT when combining Overlapping and Fusing, we lose the dependencies of allreduced gradients in the data flow graph since the Coalescence of gradients fusing.

The Common solution is to conduct the after-allreduce-sync right after fuse-allreduce (like before this PR and in Paddle Parallel Executor). But this way would lead to insufficiency of overlapping:
The CPU timeline: there is a "wait" right after nccl-sumarray

The GPU timeline: insufficiency of overlapping, the allreduce cloud only overlap with SumArray computation, but not with LayerNorm Backward.

In this PR, we resolute the exactly data flow dependency after Fusing, and instead performming the after-allreduce-sync immediately after allreduce, we put that wait where it should be (and as late as possible), which favor a better fully Overlapping.
The CPU timeline: the "right after" after-allreduce-sync is move to where it should be

The GPU timeline: better fully overlapping of allreduce and the later LayerNorm Backward kernel.

GPT3-1.3B-dp8， gbz=64	Mem (MB)	tokens/sec
Before PR	25014	88503
After	25252	95420

to reproduce the performance you need another two PR:
support exe ctx in Comm op 48308
disable redundant dependency and prior comm op in standalone exe 48454

…exe-recompute

…exe-dp

aoyulong

LGTM

JZ-LIANG added 30 commits November 9, 2022 11:57

add depend

d73a5eb

add depend

6414f04

add depend

9565b84

add depend

ca8696c

add depend

7234931

add depend

e711b4f

add depend

075eabf

add depend

11e394c

add depend

5c050ca

add depend

eb13147

add depend

c7ca20f

add depend

d29a95c

add depend

e96a93d

add depend

79b2e77

add origin amp files

29a60ab

fp16 distinguish None & False

b8f3f69

engine log

f07fd8a

engine log

e6b1995

engine log

8b4f299

engine log

4ea87d6

engine log

97d490f

Merge remote-tracking branch 'upstream/develop' into AutoParallel/new…

187e8a9

…exe-recompute

engine log

7149e09

Merge remote-tracking branch 'upstream/develop' into AutoParallel/new…

d994cb8

…exe-recompute

log

34d09b0

log

0a09420

profile

0e6a1f6

issued order comm first calc later

b6c097b

disable comm op seq dep

6963024

dp add deps for graph exe

f0aab8c

JZ-LIANG added 23 commits November 18, 2022 11:30

bugfix

5f967ce

bugfix

3d96184

bugfix

e0592e0

bugfix

48912f3

bugfix

8c99fb4

bugfix

809c27b

bugfix

aefff08

bugfix

22fccd9

bugfix

bd0483d

bugfix

eefe981

bugfix

14fdb51

bugfix

80c53e0

add deps for clip

4cec50d

add deps for clip

927079f

add deps for clip

d13c3cc

add dep for grad clip

53ca0e3

add dep for grad clip

5e8df91

local

00d8e3b

clean code

6d5d25b

Merge remote-tracking branch 'upstream/develop' into AutoParallel/new…

5e774ee

…exe-dp

dep ops in comm stream

aa846e2

Merge remote-tracking branch 'upstream/develop' into AutoParallel/new…

57e0132

…exe-dp

unitest

b9b8755

JZ-LIANG changed the title ~~[Auto Parallel-Optimization] Adapt Data Parallel for Graph executor~~ [Auto Parallel-Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping when uses Graph executor Nov 28, 2022

JZ-LIANG changed the title ~~[Auto Parallel-Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping when uses Graph executor~~ [Auto Parallel Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping Nov 28, 2022

JZ-LIANG changed the title ~~[Auto Parallel Optimization] Optimizing data parallel Fuse-Allreduce-Overlapping~~ [Auto Parallel Perf] Optimizing data parallel Fuse-Allreduce-Overlapping Nov 28, 2022

JZ-LIANG changed the title ~~[Auto Parallel Perf] Optimizing data parallel Fuse-Allreduce-Overlapping~~ [Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping Nov 28, 2022

aoyulong approved these changes Nov 29, 2022

View reviewed changes

JZ-LIANG merged commit 23e5b25 into PaddlePaddle:develop Nov 29, 2022

JZ-LIANG changed the title ~~[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping~~ Optimizing data parallel Fuse-Allreduce-Overlapping Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing data parallel Fuse-Allreduce-Overlapping #48092

Optimizing data parallel Fuse-Allreduce-Overlapping #48092

JZ-LIANG commented Nov 17, 2022 •

edited

Loading

aoyulong left a comment

Optimizing data parallel Fuse-Allreduce-Overlapping #48092

Optimizing data parallel Fuse-Allreduce-Overlapping #48092

Conversation

JZ-LIANG commented Nov 17, 2022 • edited Loading

PR types

PR changes

Describe

aoyulong left a comment

Choose a reason for hiding this comment

JZ-LIANG commented Nov 17, 2022 •

edited

Loading