Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPU multi-card support eager mode #47445

Merged
merged 10 commits into from
Nov 10, 2022
Merged

Conversation

XiaociZhang
Copy link
Contributor

@XiaociZhang XiaociZhang commented Oct 28, 2022

PR types

New features

PR changes

Others

Describe

XPU multi-card support eager mode dygraph

@paddle-bot
Copy link

paddle-bot bot commented Oct 28, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added contributor External developers status: proposed labels Oct 28, 2022
@XiaociZhang XiaociZhang force-pushed the eager branch 5 times, most recently from 0cc9fd0 to 4b18929 Compare November 3, 2022 06:47
@XiaociZhang XiaociZhang changed the title XPU support eager mode XPU multi-card support eager mode Nov 3, 2022
const std::vector<phi::DenseTensor>& inputs);

// TODO(zhangxiaoci): XPU do not support event query for now
// bool IsCompleted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议这里暂时输出一个warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

using Place = paddle::platform::Place;

// BKCL funcs use separate communication stream by default
class ProcessGroupBKCL : public ProcessGroup {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BKCL带有stream概念,应该继承ProcessGroupStream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

const std::vector<phi::DenseTensor>& inputs);

std::shared_ptr<Store> store_;
std::shared_ptr<BKCLCommManager> BKCL_comm_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个变量是不是没有使用

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

std::unordered_map<std::string, std::vector<std::unique_ptr<XPUContext>>>
places_to_ctx_;

std::set<int> used_place_ids_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个好像也没有用到

Copy link
Contributor

@LiYuRio LiYuRio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

麻烦看一下collecitve/Utils.h下的Split和Concat是不是也要做相应的修改

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

#include "paddle/fluid/platform/device/xpu/xpu_info.h"
#include "paddle/fluid/platform/device_context.h"
#include "paddle/fluid/platform/place.h"
#include "paddle/phi/common/place.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include的platform的应该就可以了,phi这个可以移除

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

import numpy as np
import paddle.distributed as dist
import paddle.fluid as fluid
from paddle.fluid.dygraph.nn import Linear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议换成paddle.nn.Linear,fluid Linear马上要删除了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

import numpy as np
import paddle.distributed as dist
import paddle.fluid as fluid
from paddle.fluid.dygraph.nn import Linear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

shape=[out_dim, in_dim], dtype="float32"
)

# just for test sync_params_buffers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释还有用吗?需要移除吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块需要的,注释忘记删了。。

@gongweibao gongweibao requested a review from LiYuRio November 4, 2022 08:54
@XiaociZhang
Copy link
Contributor Author

XiaociZhang commented Nov 4, 2022

麻烦看一下collecitve/Utils.h下的Split和Concat是不是也要做相应的修改

看起来SplitTensor需要改,ConcatTensor没看到调用的地方
btw XPU的修改在调用distributed::SplitFunctor的地方换成走operators::math::SplitFunctor就可以了,其它平台是否有类似问题?后面建议统一改成走math下的函数?

@LiYuRio
Copy link
Contributor

LiYuRio commented Nov 4, 2022

麻烦看一下collecitve/Utils.h下的Split和Concat是不是也要做相应的修改

看起来SplitTensor需要改,ConcatTensor没看到调用的地方 btw XPU的修改在调用distributed::SplitFunctor的地方换成走operators::math::SplitFunctor就可以了,其它平台是否有类似问题?后面建议统一改成走math下的函数?

这里 @chenwhql 有什么建议吗,processgroup应该依赖phi下的functor还是fluid下的?

@chenwhql
Copy link
Contributor

chenwhql commented Nov 4, 2022

SplitFunctor

phi下的比较好,fluid的后面是要删除的,现在也只是一层壳,里面调用的是phi的

chenwhql
chenwhql previously approved these changes Nov 8, 2022
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@LiYuRio LiYuRio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


#include "paddle/phi/kernels/funcs/concat_and_split_functor.h"

#include "paddle/fluid/platform/device_context.h"
Copy link
Contributor

@chenwhql chenwhql Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里漏了,应该用PHI下的Context,建议再提个PR修改下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@QingshuChen QingshuChen merged commit 3b91f8f into PaddlePaddle:develop Nov 10, 2022
@paddle-bot
Copy link

paddle-bot bot commented Nov 10, 2022

你的PR已合入Paddle库,请关注后续测试结果。
Your PR has been merged into the repository. An official integration test will be conducted later. Stay tuned.

if (barrier_) {
// If we use the work to do barrier, we should block cpu
platform::XPUDeviceGuard guard(place_.GetDeviceId());
xpu_wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这边xpu_wait不需要stream吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里目标是block host,类似cudaDeviceSync。后面需要再提个pr,具体怎么实现还要再看看,跟其他同步问题一起解掉

@@ -325,6 +376,17 @@ void EagerGroup::ConcatTensors(const platform::Place &place) {
"Paddle can't concat grad tensors since it's not compiled with "
"CUSTOM_DEVICE,"
"Please recompile or reinstall Paddle with CUSTOM_DEVICE support."));
#endif
} else if (platform::is_xpu_place(place)) {
#if defined(PADDLE_WITH_XPU_BKCL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在是不是能默认开启BKCL,不需要通过宏判断?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该可以,后面有时间整理下

xdims_list,
split_list,
axis);
PADDLE_ENFORCE_EQ(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这类后面都可以改为PADDLE_ENFORCE_XDNN_xxxxx这个宏,节省代码行数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants