-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NewComm] No.10 compatiable upgrade for distributed_fused_lamb op #57424
[NewComm] No.10 compatiable upgrade for distributed_fused_lamb op #57424
Conversation
单测目前存在问题, |
换一个初始化方式。如果使用新通信库,使用 |
…tributed_fused_lamb
@@ -270,7 +270,10 @@ def setUpClass(cls): | |||
paddle.enable_static() | |||
paddle.set_flags({'FLAGS_cudnn_deterministic': True}) | |||
_clip_by_global_norm_using_mp_type(True) | |||
fleet.init(role_maker=get_role_maker()) | |||
if os.environ.get("FLAGS_dynamic_static_unified_comm") == "1": | |||
fleet.init(role_maker=get_role_maker()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
判断写反了吧,设置FLAGS_dynamic_static_unified_comm = 1
的时候,应该用paddle.distributed.collective._init_parallel_env("nccl")
的方式初始化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -228,5 +231,19 @@ void NCCLCommContext::GroupStart() { | |||
} | |||
void NCCLCommContext::GroupEnd() { NCCL_CHECK(phi::dynload::ncclGroupEnd()); } | |||
|
|||
#if NCCL_VERSION_CODE >= 21100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里加一点注释信息吧,解释一下这个函数是干啥的,直接看名字很难弄懂功能。可以附上链接:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/ops.html,把里面Op功能的解释,整点到注释里。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
4f2badd
to
8a29c88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ddlePaddle#57424) * [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op * fix
…ddlePaddle#57424) * [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op * fix
…ddlePaddle#57424) * [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op * fix
…ddlePaddle#57424) * [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op * fix
PR types
Others
PR changes
APIs
Description
compatiable upgrade for
distributed_fused_lamb
op#57102