-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XPU multi-card support eager mode #47445
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
0cc9fd0
to
4b18929
Compare
const std::vector<phi::DenseTensor>& inputs); | ||
|
||
// TODO(zhangxiaoci): XPU do not support event query for now | ||
// bool IsCompleted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议这里暂时输出一个warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
using Place = paddle::platform::Place; | ||
|
||
// BKCL funcs use separate communication stream by default | ||
class ProcessGroupBKCL : public ProcessGroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BKCL带有stream概念,应该继承ProcessGroupStream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
const std::vector<phi::DenseTensor>& inputs); | ||
|
||
std::shared_ptr<Store> store_; | ||
std::shared_ptr<BKCLCommManager> BKCL_comm_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个变量是不是没有使用
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
std::unordered_map<std::string, std::vector<std::unique_ptr<XPUContext>>> | ||
places_to_ctx_; | ||
|
||
std::set<int> used_place_ids_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个好像也没有用到
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
麻烦看一下collecitve/Utils.h下的Split和Concat是不是也要做相应的修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
#include "paddle/fluid/platform/device/xpu/xpu_info.h" | ||
#include "paddle/fluid/platform/device_context.h" | ||
#include "paddle/fluid/platform/place.h" | ||
#include "paddle/phi/common/place.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include的platform的应该就可以了,phi这个可以移除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
import numpy as np | ||
import paddle.distributed as dist | ||
import paddle.fluid as fluid | ||
from paddle.fluid.dygraph.nn import Linear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里建议换成paddle.nn.Linear,fluid Linear马上要删除了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
import numpy as np | ||
import paddle.distributed as dist | ||
import paddle.fluid as fluid | ||
from paddle.fluid.dygraph.nn import Linear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
shape=[out_dim, in_dim], dtype="float32" | ||
) | ||
|
||
# just for test sync_params_buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释还有用吗?需要移除吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块需要的,注释忘记删了。。
看起来SplitTensor需要改,ConcatTensor没看到调用的地方 |
这里 @chenwhql 有什么建议吗,processgroup应该依赖phi下的functor还是fluid下的? |
phi下的比较好,fluid的后面是要删除的,现在也只是一层壳,里面调用的是phi的 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
2. ProcessGroupBKCL inherit from ProcessGroupStream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
#include "paddle/phi/kernels/funcs/concat_and_split_functor.h" | ||
|
||
#include "paddle/fluid/platform/device_context.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里漏了,应该用PHI下的Context,建议再提个PR修改下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
你的PR已合入Paddle库,请关注后续测试结果。 |
if (barrier_) { | ||
// If we use the work to do barrier, we should block cpu | ||
platform::XPUDeviceGuard guard(place_.GetDeviceId()); | ||
xpu_wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这边xpu_wait不需要stream吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里目标是block host,类似cudaDeviceSync。后面需要再提个pr,具体怎么实现还要再看看,跟其他同步问题一起解掉
@@ -325,6 +376,17 @@ void EagerGroup::ConcatTensors(const platform::Place &place) { | |||
"Paddle can't concat grad tensors since it's not compiled with " | |||
"CUSTOM_DEVICE," | |||
"Please recompile or reinstall Paddle with CUSTOM_DEVICE support.")); | |||
#endif | |||
} else if (platform::is_xpu_place(place)) { | |||
#if defined(PADDLE_WITH_XPU_BKCL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在是不是能默认开启BKCL,不需要通过宏判断?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该可以,后面有时间整理下
xdims_list, | ||
split_list, | ||
axis); | ||
PADDLE_ENFORCE_EQ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这类后面都可以改为PADDLE_ENFORCE_XDNN_xxxxx这个宏,节省代码行数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
PR types
New features
PR changes
Others
Describe
XPU multi-card support eager mode dygraph