[AutoParallel] Support pipeline parallelism backward non-computation clip. #58449

GhostScreaming · 2023-10-27T07:29:48Z

PR types

New features

PR changes

Others

Description

Pcard-73145

支持流水线并行反向的非计算rank计算裁剪。前向PR参考PR 58126，paddle::distributed::reshard构建前反向的PR参考PR 58238。重点对创建反向图时，对unintialized的Tensor行为进行了特殊处理。

对于IsRunAutoParallel()的情况，跳过FillZeroForEmptyGradInput处理。
SetGradInMeta特殊处理PP的情况
GradTensorHolder::add特殊处理PP的情况，防止反向节点之间的边未连接。

which is needed for pipeline parallel.

… support_reshard_backward

…comments.

not allowed to include files in phi/api.

… support_reshard_backward

strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later.

paddle-bot · 2023-10-27T07:29:53Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

CI Machine now as it needs 8 gpus.

wanghuancoder · 2023-10-30T02:12:18Z

paddle/fluid/eager/auto_code_generator/generator/eager_gen.py

+                        fill_zero_str += f"{indent}if (!IsRunAutoParallel()) {{\n{indent}{indent}egr::EagerUtils::FillZeroForEmptyOptionalGradInput(&grads[{fwd_position}][0], input_metas[{fwd_position}][0]);\n{indent}}}"
                else:
                    if IsPlainTensorType(ttype):
-                        fill_zero_str += f"{indent}egr::EagerUtils::FillZeroForEmptyGradInput(&grads[{fwd_position}][0], input_metas[{fwd_position}][0]);\n"
+                        fill_zero_str += f"{indent}if (!IsRunAutoParallel()) {{\n{indent}{indent}egr::EagerUtils::FillZeroForEmptyGradInput(&grads[{fwd_position}][0], input_metas[{fwd_position}][0]);\n{indent}}}"
                    else:
-                        fill_zero_str += f"{indent}egr::EagerUtils::FillZeroForEmptyGradInput(&grads[{fwd_position}], input_metas[{fwd_position}]);\n"
+                        fill_zero_str += f"{indent}if (!IsRunAutoParallel()) {{\n{indent}{indent}egr::EagerUtils::FillZeroForEmptyGradInput(&grads[{fwd_position}], input_metas[{fwd_position}]);\n{indent}}}"


如果这个if是所有情况都要有的。可以放到FillZeroForEmptyOptionalGradInput和FillZeroForEmptyGradInput函数里来实现。放在gen.py里会越来越难维护。生成的代码也不容易阅读。

IsRunAutoParallel()方法绑定在GradNodeBase类里，判断条件放在nodes.cc代码会简洁一点。修改了eager_gen.py的代码生成逻辑，增强代码可读性，thx~

LiYuRio · 2023-10-30T06:55:25Z

test/auto_parallel/semi_auto_parallel_simple_net.py

-        self.mp_loss, _, _ = self.run_dynamic(
-            PPDemoNet(self.w0, self.w1, self._pp_mesh0, self._pp_mesh1),
-            is_pp=True,
+        self.mp_loss, self.mp_w0, self.mp_w1 = self.run_dynamic(


这里为啥是mp_loss

chenwhql · 2023-10-30T08:57:24Z

test/auto_parallel/semi_auto_parallel_simple_net_hybrid.py

+        # modify test_semi_auto_parallel_hybrid_strategy.py `setUp` function,
+        # just set num_of_devices=8, nnode =1 and _changeable_envs = {"backend": ["gpu"]}
+        # to test it.
+        # self.dp_mp_pp_demo_net()


这个单测原来也是CPU模拟执行的，这里三种策略混合CI会挂吗

现在pp的send/recv不支持在CPU执行，会挂掉

… support_reshard_backward

…utation clip. (#58449)" (#58601) This reverts commit c569297.

…non-computation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601)" This reverts commit 79e24ec.

…clip. (#58609) * [AutoParallel] Support paddle.distributed.reshard construct GradNode, which is needed for pipeline parallel. * Fix problem of CI, and fix pp testcase as review comments advising. * Fix including files problem. * Polish paddle.distributed.reshard implementation according to review comments. * Fix some problems. * Polish code. * Fix problem of failed testcase. * Move reshard function to tensor_utils.h, as files in phi/core is not allowed to include files in phi/api. * Add forgetting file. * Fix some compilation problem. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Fix problem of WITH_PYTHON=OFF compilation option. * Fix bug of conditional compilation. * [AutoParallel] Support pipeline parallel backward. Both pp single strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later. * Polish pipeline parallel backward implementation. * Remove useless modification. * Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on CI Machine now as it needs 8 gpus. * Remove useless modification. * Fix problem of Tensor double free and polish code. * Fix problem of ReshardOutputPartialAxisToReplicated. * Revert "Revert "[AutoParallel] Support pipeline parallelism backward non-computation clip. (#58449)" (#58601)" This reverts commit 79e24ec.

…clip. (PaddlePaddle#58449) * [AutoParallel] Support paddle.distributed.reshard construct GradNode, which is needed for pipeline parallel. * Fix problem of CI, and fix pp testcase as review comments advising. * Fix including files problem. * Polish paddle.distributed.reshard implementation according to review comments. * Fix some problems. * Polish code. * Fix problem of failed testcase. * Move reshard function to tensor_utils.h, as files in phi/core is not allowed to include files in phi/api. * Add forgetting file. * Fix some compilation problem. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Fix problem of WITH_PYTHON=OFF compilation option. * Fix bug of conditional compilation. * [AutoParallel] Support pipeline parallel backward. Both pp single strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later. * Polish pipeline parallel backward implementation. * Remove useless modification. * Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on CI Machine now as it needs 8 gpus. * Remove useless modification. * Fix problem of Tensor double free and polish code. * Fix problem of ReshardOutputPartialAxisToReplicated.

…utation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601) This reverts commit c569297.

…clip. (PaddlePaddle#58609) * [AutoParallel] Support paddle.distributed.reshard construct GradNode, which is needed for pipeline parallel. * Fix problem of CI, and fix pp testcase as review comments advising. * Fix including files problem. * Polish paddle.distributed.reshard implementation according to review comments. * Fix some problems. * Polish code. * Fix problem of failed testcase. * Move reshard function to tensor_utils.h, as files in phi/core is not allowed to include files in phi/api. * Add forgetting file. * Fix some compilation problem. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Fix problem of WITH_PYTHON=OFF compilation option. * Fix bug of conditional compilation. * [AutoParallel] Support pipeline parallel backward. Both pp single strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later. * Polish pipeline parallel backward implementation. * Remove useless modification. * Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on CI Machine now as it needs 8 gpus. * Remove useless modification. * Fix problem of Tensor double free and polish code. * Fix problem of ReshardOutputPartialAxisToReplicated. * Revert "Revert "[AutoParallel] Support pipeline parallelism backward non-computation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601)" This reverts commit 79e24ec.

…clip. (PaddlePaddle#58449) * [AutoParallel] Support paddle.distributed.reshard construct GradNode, which is needed for pipeline parallel. * Fix problem of CI, and fix pp testcase as review comments advising. * Fix including files problem. * Polish paddle.distributed.reshard implementation according to review comments. * Fix some problems. * Polish code. * Fix problem of failed testcase. * Move reshard function to tensor_utils.h, as files in phi/core is not allowed to include files in phi/api. * Add forgetting file. * Fix some compilation problem. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Fix problem of WITH_PYTHON=OFF compilation option. * Fix bug of conditional compilation. * [AutoParallel] Support pipeline parallel backward. Both pp single strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later. * Polish pipeline parallel backward implementation. * Remove useless modification. * Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on CI Machine now as it needs 8 gpus. * Remove useless modification. * Fix problem of Tensor double free and polish code. * Fix problem of ReshardOutputPartialAxisToReplicated.

…utation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601) This reverts commit c569297.

…clip. (PaddlePaddle#58609) * [AutoParallel] Support paddle.distributed.reshard construct GradNode, which is needed for pipeline parallel. * Fix problem of CI, and fix pp testcase as review comments advising. * Fix including files problem. * Polish paddle.distributed.reshard implementation according to review comments. * Fix some problems. * Polish code. * Fix problem of failed testcase. * Move reshard function to tensor_utils.h, as files in phi/core is not allowed to include files in phi/api. * Add forgetting file. * Fix some compilation problem. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation. * Fix problem of WITH_PYTHON=OFF compilation option. * Fix bug of conditional compilation. * [AutoParallel] Support pipeline parallel backward. Both pp single strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later. * Polish pipeline parallel backward implementation. * Remove useless modification. * Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on CI Machine now as it needs 8 gpus. * Remove useless modification. * Fix problem of Tensor double free and polish code. * Fix problem of ReshardOutputPartialAxisToReplicated. * Revert "Revert "[AutoParallel] Support pipeline parallelism backward non-computation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601)" This reverts commit 79e24ec.

GhostScreaming added 17 commits October 19, 2023 16:33

[AutoParallel] Support paddle.distributed.reshard construct GradNode,

c95797b

which is needed for pipeline parallel.

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

243d0d3

… support_reshard_backward

Fix problem of CI, and fix pp testcase as review comments advising.

b241620

Fix including files problem.

057114d

Polish paddle.distributed.reshard implementation according to review …

9f5c0a7

…comments.

Fix some problems.

7e71a8e

Polish code.

1ba8ae8

Fix problem of failed testcase.

bc3db47

Move reshard function to tensor_utils.h, as files in phi/core is

343e6c0

not allowed to include files in phi/api.

Add forgetting file.

53c3d19

Fix some compilation problem.

58004ef

Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation.

23cdf20

Remove useless PADDLE_WITH_DISTRIBUTE conditional compilation.

0544d20

Fix problem of WITH_PYTHON=OFF compilation option.

5c0b70e

Fix bug of conditional compilation.

1d7e5ad

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

083e48e

… support_reshard_backward

[AutoParallel] Support pipeline parallel backward. Both pp single

c3929fb

strategy and dp-mp-pp hybrid strategy are verified. As CI machine only has 2 cards and dp-mp-pp strategy needs 9 GPU cards, such case will be added in testcase later.

paddle-bot bot added the contributor External developers label Oct 27, 2023

GhostScreaming added 4 commits October 27, 2023 16:34

Polish pipeline parallel backward implementation.

3dd0d3e

Remove useless modification.

eec1be9

Add MLP dp-mp-pp hybrid strategy testcase, it can't be run on

5c9fadf

CI Machine now as it needs 8 gpus.

Remove useless modification.

a024e6a

wanghuancoder previously approved these changes Oct 30, 2023

View reviewed changes

Fix problem of Tensor double free and polish code.

624648a

GhostScreaming dismissed wanghuancoder’s stale review via 624648a October 30, 2023 04:07

LiYuRio reviewed Oct 30, 2023

View reviewed changes

chenwhql reviewed Oct 30, 2023

View reviewed changes

wanghuancoder previously approved these changes Oct 31, 2023

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

bfa3d52

… support_reshard_backward

Fix problem of ReshardOutputPartialAxisToReplicated.

fa792bd

GhostScreaming dismissed wanghuancoder’s stale review via fa792bd November 1, 2023 08:05

chenwhql approved these changes Nov 2, 2023

View reviewed changes

LiYuRio approved these changes Nov 2, 2023

View reviewed changes

GhostScreaming merged commit c569297 into PaddlePaddle:develop Nov 2, 2023

tianshuo78520a mentioned this pull request Nov 2, 2023

Revert "[AutoParallel] Support pipeline parallelism backward non-computation clip." #58601

Merged

XieYunshen pushed a commit that referenced this pull request Nov 2, 2023

Revert "[AutoParallel] Support pipeline parallelism backward non-comp…

79e24ec

…utation clip. (#58449)" (#58601) This reverts commit c569297.

GhostScreaming mentioned this pull request Nov 2, 2023

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58609

Merged

GhostScreaming added a commit to GhostScreaming/Paddle that referenced this pull request Nov 2, 2023

Revert "Revert "[AutoParallel] Support pipeline parallelism backward …

ce6a6e1

…non-computation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601)" This reverts commit 79e24ec.

paddle-bot bot removed the contributor External developers label Nov 3, 2023

zeroRains pushed a commit to zeroRains/Paddle that referenced this pull request Nov 8, 2023

Revert "[AutoParallel] Support pipeline parallelism backward non-comp…

9a3977a

…utation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601) This reverts commit c569297.

danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023

Revert "[AutoParallel] Support pipeline parallelism backward non-comp…

9f35dbd

…utation clip. (PaddlePaddle#58449)" (PaddlePaddle#58601) This reverts commit c569297.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58449

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58449

GhostScreaming commented Oct 27, 2023 •

edited

Loading

paddle-bot bot commented Oct 27, 2023

wanghuancoder Oct 30, 2023

GhostScreaming Oct 30, 2023

LiYuRio Oct 30, 2023

chenwhql Oct 30, 2023

GhostScreaming Oct 30, 2023

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58449

[AutoParallel] Support pipeline parallelism backward non-computation clip. #58449

Conversation

GhostScreaming commented Oct 27, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Oct 27, 2023

wanghuancoder Oct 30, 2023

Choose a reason for hiding this comment

GhostScreaming Oct 30, 2023

Choose a reason for hiding this comment

LiYuRio Oct 30, 2023

Choose a reason for hiding this comment

chenwhql Oct 30, 2023

Choose a reason for hiding this comment

GhostScreaming Oct 30, 2023

Choose a reason for hiding this comment

GhostScreaming commented Oct 27, 2023 •

edited

Loading