[hybrid] optimizer sharding support optimize cast #35878

wangxicoding · 2021-09-18T13:12:58Z

PR types

Performance optimization

PR changes

Others

Describe

optimizer sharding support optimize cast.

将前向反向的参数cast移到优化器阶段，减少cast个数，提升性能。
在optimizer_sharding中，只需存储自己所需的fp32参数，在dp_degree > 2 时可节约显存。

精度测试

Ernie3.0，base模型，单机8卡
baseline=2mp+2pp+2dp， optimize_cast=2mp+2pp+2opt_sharding+optimize_cast

速度测试

模型配置	值
hidden_size	3072
num_attention_heads	48
num_hidden_layers	39
num_sharding_layers	36
branch_hidden_size	256
branch_num_attention_heads	4

baseline(token/s)	optimize_cast(token/s)	提升
6421	6868	6.96%

paddle-bot-old · 2021-09-18T13:13:01Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JZ-LIANG

LGTM

JZ-LIANG · 2021-09-28T02:27:47Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

        startup_block.append_op(
            type='c_sync_comm_stream',
-            inputs={'X': broadcast_params},
-            outputs={'Out': broadcast_params},
+            inputs={'X': params_name},


if the broadcast in launched into calc stream， there is not need to sync calc stream at the end of broadcasts.

Yes, I originally wanted to delete it in this PR, but there are too many unittest that need to be changed, so I kept it first... will remove in future.

JZ-LIANG · 2021-09-28T02:53:40Z

python/paddle/distributed/fleet/meta_optimizers/sharding/offload_helper.py

+                # param is only used by cast op,
+                # which to cast fp32_param to fp16_param
+                output_name = op.output_arg_names[0]
+                if 'cast_fp16' not in output_name:


better to use a global variable to record the 'cast_fp16' rule, otherwise if this pattern is change in AMP, we should change everywhere in sharding

get, good idea

JZ-LIANG · 2021-09-28T03:10:04Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

-            offload_helper.cast_fp32param_in_optimize(main_block, startup_block)
+            offload_helper = OffloadHelper(ring_id=dp_ring_id)
+            if self._optimizer_sharding:
+                offload_helper.opt_sharding_cast_fp32param(


Great job, not only reduce the number of cast op from twice per param to once per param, but also reduce the frequency of cast call to 1/acc_step !

wangxicoding added 2 commits September 18, 2021 21:16

optimizer sharding add optimize_cast

b51cd14

fix param persistable

044ec8b

wangxicoding force-pushed the opt_sharding_optimize_cast branch from 09537d3 to 044ec8b Compare September 18, 2021 13:16

wangxicoding added 3 commits September 24, 2021 17:03

add fuse broadcast when optimize cast

a665054

fix

0a0ee57

ci coverage

2720000

wangxicoding changed the title ~~optimizer sharding support optimize cast~~ [hybrid] optimizer sharding support optimize cast Sep 24, 2021

wangxicoding added 2 commits September 25, 2021 00:14

fix param

cdf20c3

ci coverage

fa5932d

wangxicoding requested review from JZ-LIANG, gongweibao and fuyinno4 September 27, 2021 03:15

JZ-LIANG approved these changes Sep 27, 2021

View reviewed changes

wangxicoding merged commit eef0a94 into PaddlePaddle:develop Sep 28, 2021

wangxicoding deleted the opt_sharding_optimize_cast branch September 28, 2021 02:45

JZ-LIANG approved these changes Sep 28, 2021

View reviewed changes

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021

[hybrid] optimizer sharding support optimize cast (PaddlePaddle#35878)

ad33a69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hybrid] optimizer sharding support optimize cast #35878

[hybrid] optimizer sharding support optimize cast #35878

wangxicoding commented Sep 18, 2021 •

edited by lelelelelez

Loading

paddle-bot-old bot commented Sep 18, 2021

JZ-LIANG left a comment

JZ-LIANG Sep 28, 2021

wangxicoding Sep 28, 2021

JZ-LIANG Sep 28, 2021

wangxicoding Sep 28, 2021

JZ-LIANG Sep 28, 2021

[hybrid] optimizer sharding support optimize cast #35878

[hybrid] optimizer sharding support optimize cast #35878

Conversation

wangxicoding commented Sep 18, 2021 • edited by lelelelelez Loading

PR types

PR changes

Describe

精度测试

速度测试

paddle-bot-old bot commented Sep 18, 2021

JZ-LIANG left a comment

Choose a reason for hiding this comment

JZ-LIANG Sep 28, 2021

Choose a reason for hiding this comment

wangxicoding Sep 28, 2021

Choose a reason for hiding this comment

JZ-LIANG Sep 28, 2021

Choose a reason for hiding this comment

wangxicoding Sep 28, 2021

Choose a reason for hiding this comment

JZ-LIANG Sep 28, 2021

Choose a reason for hiding this comment

wangxicoding commented Sep 18, 2021 •

edited by lelelelelez

Loading