Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paddle-ASP]Support sharding training for the Nvidia's ASP(2:4 sparsity) functionality #37725

Merged
merged 11 commits into from
Jan 6, 2022

Conversation

minghaoBD
Copy link
Contributor

@minghaoBD minghaoBD commented Nov 30, 2021

PR types

Bug fixes

PR changes

Others

Describe

Nvidia has implemented 2:4 sparsity code in PaddlePaddle, supporting fleet distributed training. But when we are trying to train with sharding strategy (the model parallel paradigm in PaddlePaddle), GPU:0 will always be OOM while other GPUs seems normal.

After fix, developers should pass in an argument: shading=True when calling sparsity.prune_model() with the sharding strategy. Otherwise no difference when using the APIs.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Nov 30, 2021

✅ This PR's description meets the template requirements!
Please wait for other CI results.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UT

@@ -150,7 +155,8 @@ def prune_model(main_program=None,
n=2,
m=4,
mask_algo='mask_1d',
with_mask=True):
with_mask=True,
sharding=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以让用户直接传一个place么?那种方式更好理解?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

理解是好理解,但是我们需要给用户额外说明。
此外,即使说明了,也有经常place设置错误的风险,然后出bug。
我的理解是,place在prune_model里面,代码稳定性更高一些呢?

@paddle-bot-old
Copy link

paddle-bot-old bot commented Dec 9, 2021

Sorry to inform you that 13083b1's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@minghaoBD
Copy link
Contributor Author

Please add UT

added tests for optimizer compatibility and modified prune_model API.

Copy link
Contributor

@wanghaoshuang wanghaoshuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wanghaoshuang
Copy link
Contributor

请修改下PR标题,方便后续检索和管理自己的工作。

@minghaoBD minghaoBD changed the title Asp sharding [Paddle-ASP]Asp sharding Dec 31, 2021
@minghaoBD
Copy link
Contributor Author

请修改下PR标题,方便后续检索和管理自己的工作。

Done, thanks

Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for sharding

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG API

Copy link
Contributor

@TCChenlong TCChenlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wanghaoshuang wanghaoshuang merged commit aec6e8a into PaddlePaddle:develop Jan 6, 2022
@minghaoBD minghaoBD changed the title [Paddle-ASP]Asp sharding [Paddle-ASP]Support sharding training for the Nvidia's ASP(2:4 sparsity) functionality Jan 6, 2022
@minghaoBD minghaoBD deleted the asp_sharding branch January 6, 2022 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants