XPU multi-card support eager mode #47445

XiaociZhang · 2022-10-28T08:04:01Z

PR types

New features

PR changes

Others

Describe

XPU multi-card support eager mode dygraph

paddle-bot · 2022-10-28T08:04:05Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

LiYuRio · 2022-11-04T02:18:59Z

paddle/fluid/distributed/collective/ProcessGroupBKCL.h

+             const std::vector<phi::DenseTensor>& inputs);
+
+    // TODO(zhangxiaoci): XPU do not support event query for now
+    // bool IsCompleted();


建议这里暂时输出一个warning

LiYuRio · 2022-11-04T02:19:20Z

paddle/fluid/distributed/collective/ProcessGroupBKCL.h

+using Place = paddle::platform::Place;
+
+// BKCL funcs use separate communication stream by default
+class ProcessGroupBKCL : public ProcessGroup {


BKCL带有stream概念，应该继承ProcessGroupStream

HermitSun · 2022-11-04T03:58:10Z

paddle/fluid/distributed/collective/ProcessGroupBKCL.h

+      const std::vector<phi::DenseTensor>& inputs);
+
+  std::shared_ptr<Store> store_;
+  std::shared_ptr<BKCLCommManager> BKCL_comm_;


这个变量是不是没有使用

HermitSun · 2022-11-04T03:58:21Z

paddle/fluid/distributed/collective/ProcessGroupBKCL.h

+  std::unordered_map<std::string, std::vector<std::unique_ptr<XPUContext>>>
+      places_to_ctx_;
+
+  std::set<int> used_place_ids_;


这个好像也没有用到

LiYuRio

麻烦看一下collecitve/Utils.h下的Split和Concat是不是也要做相应的修改

chenwhql

LGTM overall

chenwhql · 2022-11-04T08:44:54Z

paddle/fluid/distributed/collective/ProcessGroupBKCL.cc

+#include "paddle/fluid/platform/device/xpu/xpu_info.h"
+#include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/platform/place.h"
+#include "paddle/phi/common/place.h"


include的platform的应该就可以了，phi这个可以移除

chenwhql · 2022-11-04T08:47:44Z

python/paddle/fluid/tests/unittests/xpu/parallel_dygraph_gradient_check.py

+import numpy as np
+import paddle.distributed as dist
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.nn import Linear


这里建议换成paddle.nn.Linear，fluid Linear马上要删除了

chenwhql · 2022-11-04T08:48:02Z

python/paddle/fluid/tests/unittests/xpu/parallel_dygraph_gradient_check_in_eager_mode.py

+import numpy as np
+import paddle.distributed as dist
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.nn import Linear


chenwhql · 2022-11-04T08:48:33Z

python/paddle/fluid/tests/unittests/xpu/parallel_dygraph_gradient_check_in_eager_mode.py

+            shape=[out_dim, in_dim], dtype="float32"
+        )
+
+        # just for test sync_params_buffers


注释还有用吗？需要移除吗

这块需要的，注释忘记删了。。

XiaociZhang · 2022-11-04T09:28:16Z

麻烦看一下collecitve/Utils.h下的Split和Concat是不是也要做相应的修改

看起来SplitTensor需要改，ConcatTensor没看到调用的地方
btw XPU的修改在调用distributed::SplitFunctor的地方换成走operators::math::SplitFunctor就可以了，其它平台是否有类似问题？后面建议统一改成走math下的函数？

LiYuRio · 2022-11-04T10:24:47Z

麻烦看一下collecitve/Utils.h下的Split和Concat是不是也要做相应的修改

看起来SplitTensor需要改，ConcatTensor没看到调用的地方 btw XPU的修改在调用distributed::SplitFunctor的地方换成走operators::math::SplitFunctor就可以了，其它平台是否有类似问题？后面建议统一改成走math下的函数？

这里 @chenwhql 有什么建议吗，processgroup应该依赖phi下的functor还是fluid下的？

chenwhql · 2022-11-04T11:53:35Z

SplitFunctor

phi下的比较好，fluid的后面是要删除的，现在也只是一层壳，里面调用的是phi的

chenwhql

LGTM

2. ProcessGroupBKCL inherit from ProcessGroupStream

LiYuRio

LGTM

chenwhql · 2022-11-09T03:44:29Z

paddle/phi/kernels/xpu/concat_and_split.cc

+
+#include "paddle/phi/kernels/funcs/concat_and_split_functor.h"
+
+#include "paddle/fluid/platform/device_context.h"


这里漏了，应该用PHI下的Context，建议再提个PR修改下

XiaoguangHu01

LGTM

jzhang533

LGTM

paddle-bot · 2022-11-10T06:40:04Z

你的PR已合入Paddle库，请关注后续测试结果。
Your PR has been merged into the repository. An official integration test will be conducted later. Stay tuned.

QingshuChen · 2022-11-10T06:42:32Z

paddle/fluid/distributed/collective/ProcessGroupBKCL.cc

+  if (barrier_) {
+    // If we use the work to do barrier, we should block cpu
+    platform::XPUDeviceGuard guard(place_.GetDeviceId());
+    xpu_wait();


这边xpu_wait不需要stream吗？

这里目标是block host，类似cudaDeviceSync。后面需要再提个pr，具体怎么实现还要再看看，跟其他同步问题一起解掉

QingshuChen · 2022-11-10T06:44:47Z

paddle/fluid/distributed/collective/reducer.cc

@@ -325,6 +376,17 @@ void EagerGroup::ConcatTensors(const platform::Place &place) {
        "Paddle can't concat grad tensors since it's not compiled with "
        "CUSTOM_DEVICE,"
        "Please recompile or reinstall Paddle with CUSTOM_DEVICE support."));
+#endif
+  } else if (platform::is_xpu_place(place)) {
+#if defined(PADDLE_WITH_XPU_BKCL)


现在是不是能默认开启BKCL，不需要通过宏判断？

应该可以，后面有时间整理下

QingshuChen · 2022-11-10T06:46:34Z

paddle/phi/kernels/xpu/concat_and_split.cc

+        xdims_list,
+        split_list,
+        axis);
+    PADDLE_ENFORCE_EQ(


这类后面都可以改为PADDLE_ENFORCE_XDNN_xxxxx这个宏，节省代码行数

paddle-bot bot added contributor External developers status: proposed labels Oct 28, 2022

XiaociZhang force-pushed the eager branch 5 times, most recently from 0cc9fd0 to 4b18929 Compare November 3, 2022 06:47

XiaociZhang changed the title ~~XPU support eager mode~~ XPU multi-card support eager mode Nov 3, 2022

LiYuRio reviewed Nov 4, 2022

View reviewed changes

HermitSun reviewed Nov 4, 2022

View reviewed changes

XiaociZhang force-pushed the eager branch from 4b18929 to 4b85f12 Compare November 4, 2022 05:00

LiYuRio reviewed Nov 4, 2022

View reviewed changes

chenwhql reviewed Nov 4, 2022

View reviewed changes

gongweibao requested a review from LiYuRio November 4, 2022 08:54

XiaociZhang force-pushed the eager branch from 4b85f12 to 2e2959e Compare November 4, 2022 09:58

chenwhql previously approved these changes Nov 8, 2022

View reviewed changes

XiaociZhang dismissed chenwhql’s stale review via f75b89b November 8, 2022 03:05

XiaociZhang force-pushed the eager branch from 2e2959e to f75b89b Compare November 8, 2022 03:05

XiaociZhang added 8 commits November 8, 2022 07:07

XPU support eager mode

e9063c3

add unittest for XPU eager mode

ba2fb88

minor bugfix

c60b760

minor bugfix, test=kunlun

64307a9

correct copyright info

236d5ba

1. remove unsed vars/funcs

6701f75

2. ProcessGroupBKCL inherit from ProcessGroupStream

bugfix for fp16 in eager mode multi-card, test=kunlun

5cc2441

rebase & fix a few issues

fc40527

XiaociZhang added 2 commits November 8, 2022 07:07

use new processgroup interface, test=kunlun

5a3fdf1

fix compile issue, test=kunlun

d788d15

XiaociZhang force-pushed the eager branch from ed53524 to d788d15 Compare November 8, 2022 07:07

LiYuRio approved these changes Nov 8, 2022

View reviewed changes

chenwhql approved these changes Nov 9, 2022

View reviewed changes

chenwhql reviewed Nov 9, 2022

View reviewed changes

XiaoguangHu01 approved these changes Nov 10, 2022

View reviewed changes

zhangyk0314 approved these changes Nov 10, 2022

View reviewed changes

jzhang533 approved these changes Nov 10, 2022

View reviewed changes

QingshuChen merged commit 3b91f8f into PaddlePaddle:develop Nov 10, 2022

paddle-bot bot added status: accepted and removed status: proposed labels Nov 10, 2022

QingshuChen reviewed Nov 10, 2022

View reviewed changes

XiaociZhang deleted the eager branch November 11, 2022 05:54

SigureMo mentioned this pull request Nov 15, 2022

[CodeStyle][F821] add a missing import #48006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XPU multi-card support eager mode #47445

XPU multi-card support eager mode #47445

XiaociZhang commented Oct 28, 2022 •

edited

Loading

paddle-bot bot commented Oct 28, 2022

LiYuRio Nov 4, 2022

XiaociZhang Nov 4, 2022

LiYuRio Nov 4, 2022

XiaociZhang Nov 4, 2022

HermitSun Nov 4, 2022

XiaociZhang Nov 4, 2022

HermitSun Nov 4, 2022

LiYuRio left a comment

chenwhql left a comment

chenwhql Nov 4, 2022

XiaociZhang Nov 4, 2022

chenwhql Nov 4, 2022

XiaociZhang Nov 4, 2022

chenwhql Nov 4, 2022

chenwhql Nov 4, 2022

XiaociZhang Nov 4, 2022

XiaociZhang commented Nov 4, 2022 •

edited

Loading

LiYuRio commented Nov 4, 2022

chenwhql commented Nov 4, 2022

chenwhql left a comment

LiYuRio left a comment

chenwhql Nov 9, 2022 •

edited

Loading

XiaociZhang Nov 10, 2022

XiaoguangHu01 left a comment

jzhang533 left a comment

paddle-bot bot commented Nov 10, 2022

QingshuChen Nov 10, 2022

XiaociZhang Nov 10, 2022

QingshuChen Nov 10, 2022

XiaociZhang Nov 10, 2022

QingshuChen Nov 10, 2022

XiaociZhang Nov 10, 2022


		#include "paddle/phi/kernels/funcs/concat_and_split_functor.h"

		#include "paddle/fluid/platform/device_context.h"

XPU multi-card support eager mode #47445

XPU multi-card support eager mode #47445

Conversation

XiaociZhang commented Oct 28, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Oct 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiYuRio left a comment

Choose a reason for hiding this comment

chenwhql left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiaociZhang commented Nov 4, 2022 • edited Loading

LiYuRio commented Nov 4, 2022

chenwhql commented Nov 4, 2022

chenwhql left a comment

Choose a reason for hiding this comment

LiYuRio left a comment

Choose a reason for hiding this comment

chenwhql Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

jzhang533 left a comment

Choose a reason for hiding this comment

paddle-bot bot commented Nov 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiaociZhang commented Oct 28, 2022 •

edited

Loading

XiaociZhang commented Nov 4, 2022 •

edited

Loading

chenwhql Nov 9, 2022 •

edited

Loading