Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pp 策略调整后,模型转换,以便模型热启 #52927

Merged
merged 32 commits into from
Apr 26, 2023

Conversation

liuzhenhai93
Copy link
Contributor

@liuzhenhai93 liuzhenhai93 commented Apr 14, 2023

PR types

Bug fixes

PR changes

Others

Description

1、模型恢复的时候没有恢复 tensor 的 name, 丢失了参数与优化器之间的关联关系,需要修改 framework/io.py
2、转换模型,以便pp 策略调整后模型热启

image

@paddle-bot
Copy link

paddle-bot bot commented Apr 14, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@FeixLiu FeixLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

几个comments,和几个TODO哈:

  • pr里给一个sample codes之类的吧,类似how to run。
  • 然后pr描述里介绍一下为什么对io.py进行了更改。简单说一下必要性。
  • 添加一个两卡的分布式单测,简单的四层transformer结构,pp=2保存模型,pp=vp=2加载模型,确保流程上不会出问题。而且如果没有单测ci coverage过不了。

@liuzhenhai93 liuzhenhai93 changed the title recover tensor name during model load pp 策略调整后,模型转换,以便模型热启 Apr 18, 2023
@ZHUI
Copy link
Collaborator

ZHUI commented Apr 20, 2023

现在是切换之后,影响所有optimizer的 保存的name吗? 还是只影响PP?

默认的话,都保存为动态图name?momentum 变量所有的都会变掉吗?

@ZHUI
Copy link
Collaborator

ZHUI commented Apr 20, 2023

@FeixLiu @YuanRisheng 这边还是建议看看 optimizer 存储的name和动态图的 structure name 关联。从根本上解决问题。

具体可以调研一下竞品。

@liuzhenhai93
Copy link
Contributor Author

现在是切换之后,影响所有optimizer的 保存的name吗? 还是只影响PP?

默认的话,都保存为动态图name?momentum 变量所有的都会变掉吗?

是 tensor 对应的name 没有恢复, 导致无法建立 param -> opt 的关联

XieYunshen
XieYunshen previously approved these changes Apr 23, 2023
Copy link
Contributor

@XieYunshen XieYunshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
单测执行时间设置

sneaxiy
sneaxiy previously approved these changes Apr 23, 2023
FeixLiu
FeixLiu previously approved these changes Apr 24, 2023
Copy link
Contributor

@FeixLiu FeixLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

XiaoguangHu01
XiaoguangHu01 previously approved these changes Apr 24, 2023
Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sneaxiy
sneaxiy previously approved these changes Apr 24, 2023
Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sneaxiy sneaxiy merged commit 3650c4a into PaddlePaddle:develop Apr 26, 2023
@liuzhenhai93 liuzhenhai93 deleted the load_recover_tensor_name branch May 24, 2023 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants