-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMP]Master grad in static graph #53362
[AMP]Master grad in static graph #53362
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
❌ The PR is not created using PR's template. You can refer to this Demo. |
Sorry to inform you that 6659cca's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
|
||
|
||
@unittest.skipIf( | ||
not core.supports_bfloat16(), "place does not support BF16 evaluation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个好像只会判断CPU的place是否支持bf16,如果是GPU的话需要用这个core.is_compiled_with_cuda()
+core.is_bfloat16_supported(core.CUDAPlace(0))
判断
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
换成了判断GPU的接口
python/paddle/optimizer/adamw.py
Outdated
# master gradients | ||
self._already_create_master_grad = set() | ||
self._master_grads = {} | ||
self._master_grad = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这几行是不是不需要,我看在基类中已有设置
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adamw.init()里面没有调用super.init(),是否是因为有某些考量所以没有调用?
test/amp/amp_base_models.py
Outdated
@@ -277,6 +395,6 @@ def run_program( | |||
feed={feed_vars[0].name: x_np}, | |||
fetch_list=fetch_vars, | |||
) | |||
print(f"-- [BF16 {level}] iter={iter_id}, loss={results[0]}") | |||
# print(f"-- [BF16 {level}] iter={iter_id}, loss={results[0]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里注释要打开么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个测试的内容改成比较O1 O2的loss结果是否equal了,所以是否可以删掉这条打印?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看到有别的测试再用它,打开了
# master gradients | ||
self._already_create_master_grad = set() | ||
self._master_grads = {} | ||
self._master_grad = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
定义一个函数吧,create_master_grad_states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
if grad.name in self._master_grads: | ||
var = self._master_grads[grad.name] | ||
else: | ||
var_name = grad.name + "_fp32_master" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是否需要判断一下grad
的数据类型?或者加一个assert
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在调用这个函数的时候判断了grad的数据类型,这里是否也要再次判断下?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
增加一个assert
python/paddle/optimizer/optimizer.py
Outdated
Add ops to cast gradient to master gradient | ||
|
||
Args: | ||
param_grads(list(tuple(Tensor, Tensor))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
虽然这个函数不自动生成文档,但这参数和功能描述的格式不太符合常规
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改,请检查是否改正确了
python/paddle/optimizer/optimizer.py
Outdated
assert isinstance(target_block, framework.Block) | ||
# create | ||
for p, g in param_grads: | ||
if g.name not in self._already_create_master_grad: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里用if g.name not in self._master_grads.keys()
也能判断吧,没有必要另外存一个self._already_create_master_grad
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用if g.name not in self._master_grads.keys()
判断
python/paddle/optimizer/optimizer.py
Outdated
@@ -1170,9 +1246,10 @@ def apply_gradients(self, params_grads): | |||
|
|||
# 'optimizer(grad_clip)' or 'set_gradient_clip' | |||
if self._grad_clip is not None: | |||
# create master gradients | |||
params_grads = self._append_cast_to_master_grad_op(params_grads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果没有_grad_clip
,master_grad
能生效吗?这里的params
是master_weight
吗,即用于grad_clip
计算的param
是不是master_weight
?
我理解并不只是grad_clip
里面使用master_grad
,而是backward
之后一切需要用到grad
的地方都使用master_grad
。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果没有_grad_clip,master_grad能生效吗?
不能生效。这里写的不对,应该挪到if self._grad_clip is not None
外面去判断。随后修改
这里的params是master_weight吗,即用于grad_clip计算的param是不是master_weight?
params
不是master_weight
。grad_clip
不使用params
参数,是否需要改成传入master_weight
和master_grad
的tuple?
@@ -791,6 +798,7 @@ def decorate( | |||
use_dynamic_loss_scaling=None, | |||
use_amp_guard=False, | |||
use_promote=False, | |||
use_master_grad=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加到L792行之后,参数形式为master_grad=False
,并添加参数对应的文档
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
test/amp/amp_base_models.py
Outdated
@@ -42,14 +72,18 @@ def _build_optimizer( | |||
beta2=0.836, | |||
epsilon=1e-4, | |||
weight_decay=0.01, | |||
multi_precision=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不要加multi_precision
参数,decorate
已经支持设置master_weight
,并且O2
训练会自动设置成True
。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉了
test/amp/amp_base_models.py
Outdated
use_promote=use_promote, | ||
master_weight=True, | ||
init_loss_scaling=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_loss_scaling
也没必要设置,bfloat16
训练会自动设置成1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉了
f"The number of optimizers with multi_precison = True is expected to be {expected_num_mp}, but recieved {actual_num_mp}.", | ||
) | ||
|
||
def test_amp_fp16_o1(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个单测这是为了测试master_grad功能的话,o1的检查感觉没有必要?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的。已删除
test/amp/amp_base_models.py
Outdated
amp_dtype, | ||
amp_level, | ||
amp_lists, | ||
True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要一个grad_clip
为False
的单测
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已增加
…s unittest; 3.use a function to create master grad states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. PR描述再加强一下吧,功能是什么、怎么做的、达到了什么效果
return losses | ||
|
||
dtype = "float16" | ||
max_iters = 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能没必要跑这么多个iter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个单测测试了两个项目,1.O1和O2加master grad的loss相等,2.O1和O2不加master grad的loss不相等。两个条件同时满足出现在了第24 step,所以设置了25
seed = 0 | ||
paddle.seed(seed) | ||
np.random.seed(seed) | ||
random.seed(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seed不需要重复设置吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是想让两次startup program跑出同样的结果。这样的写法比较简单。
补充了PR描述 |
PR types
New features
PR changes
Others
Description
Pcard-70458
enable master grad on static graph mode
背景与功能和 #52235 一致,本PR在静态图做功能实现。
功能和效果
amp O2模式下训练时,bf16和fp16精度会出现梯度小于bf16/fp16精度或大于bf16/fp16表达范围,在静态图上,将梯度转为fp32后做check_finite_and_unscale、grad clip、regularization和optimizer,以确保训练精度。
使用
master_grad=True
。配置生效后,会在OptimizerWithMixedPrecision.apply_gradients接口中,创建master_grad tensor,并且在_check_finite_and_unscale之前,插入cast op,把bf16/fp16的grad转换成fp32的master_grad。check_finite_and_unscale、grad clip、regularization和optimizer都使用fp32的master grad计算。影响
启用后,会在program中插入一些cast op,并且check_finite_and_unscale、grad clip、regularization和optimizer的gradients参数变为fp32,单个step速度会变慢。