Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reshard] Implement reshard from s to r with same process_mesh #56039

Merged
merged 1 commit into from
Aug 10, 2023

Conversation

LiYuRio
Copy link
Contributor

@LiYuRio LiYuRio commented Aug 7, 2023

PR types

New features

PR changes

Others

Description

Pcard-73145

支持Shard到Replicate的状态转换,要求:

输入输出的process_mesh为一维;
输出输出不跨mesh;(process_mesh无变化)
输入的shard状态为均匀切分,Tensor的切分维度能被对应组的进程数整除。

以4卡为例,输入输出都是一维process_mesh,[0, 1, 2, 3],输出为二维Replicate状态,out_dims_mapping = [-1, -1],输入为二维Shard状态,in_tensor_shape = [4, 8]

  • 用process_mesh的0维切分输入的0维,in_dims_mapping = [0, -1],每个进程上最终有形状为[16, 8]的物理tensor。
  • 用process_mesh的0维切分输入的1维,in_dims_mapping = [-1, 0],每个进程上最终有形状为[4, 32]的物理tensor。(待实现)

TODO:

  • 静态检查,限制tensor的切分维度被对应组的进程数 整除
  • 支持切分输入的非0维,需要在all_gather后,实现split和concat

@paddle-bot
Copy link

paddle-bot bot commented Aug 7, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@LiYuRio LiYuRio force-pushed the dev_reshard branch 3 times, most recently from cfa951a to f16495d Compare August 8, 2023 08:30
@LiYuRio LiYuRio changed the title Implement reshard from s to r with same process_mesh [Reshard]Implement reshard from s to r with same process_mesh Aug 8, 2023
@LiYuRio LiYuRio changed the title [Reshard]Implement reshard from s to r with same process_mesh [Reshard] Implement reshard from s to r with same process_mesh Aug 8, 2023
@@ -109,6 +114,21 @@ std::string GetMasterEndpoint() {
return master_endpoint;
}

std::string GenUniqueCommKey(const std::vector<int64_t>& process_ids) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

仅仅支持一维process mesh吗?如果高维和低维都是一个process id编号,key是否相同?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数只负责把传入的process_ids的vector变成唯一的comm_key,有两种情况:

  1. 如果输入输出的process_mesh相同,这种可以直接调用集合通信操作,只要把展平的process_id传入就行。
  2. 如果输入输出的process_mesh不同,需要在调用函数前,结合具体情况,分组创建通信组,这时候一般创建的是点对点通信组。

不管是高维还是低维,只要它们参与通信的进程相同,key就是相同的

paddle/phi/core/distributed/auto_parallel/reshard_utils.cc Outdated Show resolved Hide resolved
Copy link
Member

@ForFishes ForFishes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LiYuRio LiYuRio merged commit 4569ae1 into PaddlePaddle:develop Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants