pp 策略调整后，模型转换，以便模型热启 #52927

liuzhenhai93 · 2023-04-14T06:54:45Z

PR types

Bug fixes

PR changes

Others

Description

1、模型恢复的时候没有恢复 tensor 的 name, 丢失了参数与优化器之间的关联关系，需要修改 framework/io.py
2、转换模型，以便pp 策略调整后模型热启

paddle-bot · 2023-04-14T06:54:49Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… load_recover_tensor_name

FeixLiu

几个comments，和几个TODO哈：

pr里给一个sample codes之类的吧，类似how to run。
然后pr描述里介绍一下为什么对io.py进行了更改。简单说一下必要性。
添加一个两卡的分布式单测，简单的四层transformer结构，pp=2保存模型，pp=vp=2加载模型，确保流程上不会出问题。而且如果没有单测ci coverage过不了。

python/paddle/distributed/fleet/utils/pp_parallel_adaptor.py

python/paddle/framework/io.py

… load_recover_tensor_name

ZHUI · 2023-04-20T11:39:55Z

现在是切换之后，影响所有optimizer的保存的name吗？还是只影响PP？

默认的话，都保存为动态图name？momentum 变量所有的都会变掉吗？

ZHUI · 2023-04-20T11:58:24Z

@FeixLiu @YuanRisheng 这边还是建议看看 optimizer 存储的name和动态图的 structure name 关联。从根本上解决问题。

具体可以调研一下竞品。

liuzhenhai93 · 2023-04-20T12:09:17Z

现在是切换之后，影响所有optimizer的保存的name吗？还是只影响PP？

默认的话，都保存为动态图name？momentum 变量所有的都会变掉吗？

是 tensor 对应的name 没有恢复，导致无法建立 param -> opt 的关联

… load_recover_tensor_name

XieYunshen

LGTM
单测执行时间设置

FeixLiu

LGTM

XiaoguangHu01

LGTM

… load_recover_tensor_name

XiaoguangHu01

LGTM

polish

b368fb6

liuzhenhai93 added 2 commits April 17, 2023 14:18

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

afe4350

… load_recover_tensor_name

polish

3b61a5f

FeixLiu reviewed Apr 17, 2023

View reviewed changes

liuzhenhai93 added 3 commits April 18, 2023 08:21

polish

fbe735a

polish

2b99dbc

polish

1379b72

liuzhenhai93 changed the title ~~recover tensor name during model load~~ pp 策略调整后，模型转换，以便模型热启 Apr 18, 2023

polish

35cff08

liuzhenhai93 commented Apr 19, 2023

View reviewed changes

python/paddle/framework/io.py Show resolved Hide resolved

sneaxiy requested review from qingqing01 and YuanRisheng April 19, 2023 08:33

liuzhenhai93 added 7 commits April 19, 2023 16:37

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f09475f

… load_recover_tensor_name

polish

8e2e048

polish

414c2f8

polish

e641d1f

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e78e5a5

… load_recover_tensor_name

polish

abbdf55

polish

de295a8

liuzhenhai93 added 7 commits April 21, 2023 00:56

polish

144e90f

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

8d368bc

… load_recover_tensor_name

polish

c69c3ac

polish

546c2e6

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e0ce22c

… load_recover_tensor_name

polish

28dbfc8

polish

28b8c04

liuzhenhai93 added 4 commits April 21, 2023 17:24

polish

fe206f4

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

7bd4bb5

… load_recover_tensor_name

polish

121fcc2

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

1c6550a

… load_recover_tensor_name

liuzhenhai93 requested a review from XieYunshen April 23, 2023 02:25

liuzhenhai93 added 2 commits April 23, 2023 10:42

polish

2366709

polish

982bfb6

XieYunshen previously approved these changes Apr 23, 2023

View reviewed changes

liuzhenhai93 requested a review from sneaxiy April 23, 2023 11:16

sneaxiy previously approved these changes Apr 23, 2023

View reviewed changes

FeixLiu previously approved these changes Apr 24, 2023

View reviewed changes

liuzhenhai93 requested a review from XiaoguangHu01 April 24, 2023 06:11

XiaoguangHu01 previously approved these changes Apr 24, 2023

View reviewed changes

Merge branch 'develop' into load_recover_tensor_name

f2574f3

liuzhenhai93 dismissed stale reviews from XiaoguangHu01, FeixLiu, sneaxiy, and XieYunshen via f2574f3 April 24, 2023 08:30

liuzhenhai93 added 2 commits April 24, 2023 18:16

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

644e1d7

… load_recover_tensor_name

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a55972d

… load_recover_tensor_name

sneaxiy previously approved these changes Apr 24, 2023

View reviewed changes

liuzhenhai93 added 2 commits April 25, 2023 15:43

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

623e6b3

… load_recover_tensor_name

polish

11cd1da

liuzhenhai93 dismissed sneaxiy’s stale review via 11cd1da April 25, 2023 07:48

sneaxiy approved these changes Apr 26, 2023

View reviewed changes

XieYunshen approved these changes Apr 26, 2023

View reviewed changes

XiaoguangHu01 approved these changes Apr 26, 2023

View reviewed changes

sneaxiy merged commit 3650c4a into PaddlePaddle:develop Apr 26, 2023

liuzhenhai93 deleted the load_recover_tensor_name branch May 24, 2023 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pp 策略调整后，模型转换，以便模型热启 #52927

pp 策略调整后，模型转换，以便模型热启 #52927

liuzhenhai93 commented Apr 14, 2023 •

edited

Loading

paddle-bot bot commented Apr 14, 2023

FeixLiu left a comment •

edited

Loading

ZHUI commented Apr 20, 2023

ZHUI commented Apr 20, 2023

liuzhenhai93 commented Apr 20, 2023

XieYunshen left a comment

FeixLiu left a comment

XiaoguangHu01 left a comment

XiaoguangHu01 left a comment

pp 策略调整后，模型转换，以便模型热启 #52927

pp 策略调整后，模型转换，以便模型热启 #52927

Conversation

liuzhenhai93 commented Apr 14, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Apr 14, 2023

FeixLiu left a comment • edited Loading

Choose a reason for hiding this comment

ZHUI commented Apr 20, 2023

ZHUI commented Apr 20, 2023

liuzhenhai93 commented Apr 20, 2023

XieYunshen left a comment

Choose a reason for hiding this comment

FeixLiu left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

liuzhenhai93 commented Apr 14, 2023 •

edited

Loading

FeixLiu left a comment •

edited

Loading