-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pp 策略调整后,模型转换,以便模型热启 #52927
pp 策略调整后,模型转换,以便模型热启 #52927
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
… load_recover_tensor_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
几个comments,和几个TODO哈:
- pr里给一个sample codes之类的吧,类似how to run。
- 然后pr描述里介绍一下为什么对io.py进行了更改。简单说一下必要性。
- 添加一个两卡的分布式单测,简单的四层transformer结构,pp=2保存模型,pp=vp=2加载模型,确保流程上不会出问题。而且如果没有单测ci coverage过不了。
… load_recover_tensor_name
… load_recover_tensor_name
现在是切换之后,影响所有optimizer的 保存的name吗? 还是只影响PP? 默认的话,都保存为动态图name?momentum 变量所有的都会变掉吗? |
@FeixLiu @YuanRisheng 这边还是建议看看 optimizer 存储的name和动态图的 structure name 关联。从根本上解决问题。 具体可以调研一下竞品。 |
是 tensor 对应的name 没有恢复, 导致无法建立 param -> opt 的关联 |
… load_recover_tensor_name
… load_recover_tensor_name
… load_recover_tensor_name
… load_recover_tensor_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
单测执行时间设置
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
f2574f3
… load_recover_tensor_name
… load_recover_tensor_name
… load_recover_tensor_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
Others
Description
1、模型恢复的时候没有恢复 tensor 的 name, 丢失了参数与优化器之间的关联关系,需要修改 framework/io.py
2、转换模型,以便pp 策略调整后模型热启