-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoParallel] Fix PHI API inplace output code generation. #59133
[AutoParallel] Fix PHI API inplace output code generation. #59133
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
… fix_inplace_api
… fix_inplace_api
paddle/phi/api/lib/data_transform.cc
Outdated
if (ReshardIsNeeded(dist_tensor->dist_attr(), dist_attr[i])) { | ||
if (need_reshard) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个有啥区别吗,ReshardIsNeeded和need_reshard感觉是差不多的名字
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReshardIsNeeded
是输入dist_tensor->dist_attr()
和输入dist_attr[i]
不一致的时候,需要进行reshard。need_reshard
是输入参数,在 PHI API 这一层判断当前 API 有没有 SPMD rules,有的话不需要再对Output进行reshard,因为InferSPMD推导出的Output DistAttr是正确的,执行完kernel得到的Output local tensor也是正确的shape。只需要设置Output的dist_attr即可。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我之后再提个新PR,换个名字
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修复,thx~
VLOG(6) << "SetInplaceOutputCorrectDistAttr input " | ||
<< tensors[i].name() << " set its dist_attr from " | ||
<< dist_tensor->dist_attr() << " to " << dist_attr[i]; | ||
dist_tensor->unsafe_set_dist_attr(dist_attr[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inplace的情况下,直接丢掉output的dist_attr,它的结果还能保证正确性吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有SPMD rules的API才会走到这个分支,因为inplace output和input共用dist_tensor,前面不能把spmd_info
的结果给output,否则reshard input会出错。output dist_attr的设置放在最后了。和SetReplicatedDistAttrForOutput
的作用类似。
to use_general_spmd_rule.
… fix_inplace_api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
auto dist_t = std::make_shared<phi::distributed::DistTensor>(phi::DDim(), | ||
dist_attr); | ||
auto dist_t = std::make_shared<phi::distributed::DistTensor>( | ||
phi::DDim(), paddle::get<0>(dist_attr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个原则上要用PADDLE_GET系列宏,或者用try_catch包裹,建议下个PR再修复一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,我下个PR一起修一下~
PR types
Bug fixes
PR changes
Others
Description
Pcard-73145
更正了 PHI API 中对Inplace Output的处理。目前没有切分推导规则的算子,会默认将输入reshard成
replicated
状态,相应的得到的输出也为replicated
。对于Inplace Output,可能导致input被修改成不符合预期的replicated
状态。例如adamw_
,初始化时参数标记为shard
(张量并行切分Linear的Weights),执行完第一个iteration后,Weights退化成replicated
,无法恢复到先前状态。对于这种情况,我们需要在兜底规则中,重新将Inplace Output reshard成初始的dist_attr。对于有切分推导规则的Inplace Output,
SetKernelDistOutput
函数中不能设置Output的分布式属性,因为Input和Output共享同一个dist_tensor,需要在最后设置正确的dist_attr,不需要对Output
进行reshard。示例代码(
adamw_
,默认切分推导规则):示例代码(
add_
,有切分推导规则):