-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PaddlePaddle Hackathon 4 No.49】:为 Paddle bce_loss 支持 float16 数据类型 #50930
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
MT x_mt = static_cast<MT>(x); | ||
MT term1 = max((static_cast<MT>(one) - x_mt) * x_mt, static_cast<MT>(eps)); | ||
return static_cast<T>(static_cast<MT>(dout) * | ||
(x_mt - static_cast<MT>(label)) / term1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eps的问题,36行,1e-12在fp16表示下会下溢出为0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已做调整,不知道是否可以这样写。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以简化一下代码?one和eps作为成员变量,初始化为MT类型。原来的构造函数可以删掉了
@@ -279,6 +280,48 @@ def init_test_cast(self): | |||
self.shape = [2, 3, 20] | |||
|
|||
|
|||
@unittest.skipIf( | |||
not core.is_compiled_with_cuda(), "core is not compiled with CUDA" | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个不需要添加了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已移除
|
||
class TestBceLossOpFP16Case1(OpTest): | ||
def init_test_cast(self): | ||
self.shape = [20, 30, 40, 50] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该继承TestBceLossOpFP16,以及拼写错误cast->case。下面的case也一样
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
place = core.CUDAPlace(0) | ||
if core.is_float16_supported(place): | ||
self.check_grad_with_place( | ||
place, ['X'], 'Out', max_relative_error=0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外整个单测的写法需要参考低精度算子单测规范https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/dev_guides/amp_precision/amp_test_dev_guide_cn.html
- 可以继承TestBceLossOp,并对其做简单修改,简化代码
- 反向的相对误差是否合理?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangting2020 请老师指导下, 不清楚什么地方写的有问题,反向的相对误差始终偏大
AssertionError: 0.42 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001
python/paddle/nn/layer/loss.py
Outdated
@@ -68,7 +68,7 @@ class BCEWithLogitsLoss(Layer): | |||
Args: | |||
weight (Tensor, optional): A manual rescaling weight given to the loss of each | |||
batch element. If given, it has to be a 1D Tensor whose size is `[N, ]`, | |||
The data type is float32, float64. Default is ``'None'``. | |||
The data type is float16, float32, float64. Default is ``'None'``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个API对应的是bce_loss吗?
另外,实现为class的API,通常实现中可能会调用functional下面的API,具体需要查看代码
- 需要对2个api的文档同步修改
- 需要对静态图分支的类型检查做修改
- 需要添加一个静态图的fp16单测,继承unittest,调用api即可。参考#51168中的静态图单测
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改。
MT x_mt = static_cast<MT>(x); | ||
MT term1 = max((static_cast<MT>(one) - x_mt) * x_mt, static_cast<MT>(eps)); | ||
return static_cast<T>(static_cast<MT>(dout) * | ||
(x_mt - static_cast<MT>(label)) / term1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以简化一下代码?one和eps作为成员变量,初始化为MT类型。原来的构造函数可以删掉了
static_cast<MT>(neg_100)); | ||
return static_cast<T>( | ||
((static_cast<MT>(label) - static_cast<MT>(one)) * term2) - | ||
(static_cast<MT>(label) * term1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里和上面也是类似的问题,我觉得可以修改下原始的实现。one和neg_100本来是成员变量,可以初始化就为MT 类型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
|
||
class TestBceLossOpFP16Case2(TestBceLossOpFP16): | ||
def init_test_case(self): | ||
self.shape = [2, 3, 20] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上述单测可以再简化一下,TestBceLossOpFP16继承了TestBceLossOp,可以对TestBceLossOp做一些调整,比如初始化case的时候能够设置dtype,shape。这样可以去掉很多冗余的代码。
max_relative_error为什么会这么大?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
暂时为了测试ci, 反向的相对误差很大,一直没找到原因
AssertionError: 0.42 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001
AssertionError: 0.81 not less than or equal to 0.001
feed={'x': x_data, 'y': y_data}, fetch_list=[out] | ||
)[0] | ||
np.testing.assert_allclose( | ||
output_pd, output_np, rtol=1e-3, atol=1e-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atol设置为0能通过吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangting2020 这里是没问题的, atol=1e-3也能通过。
… bce_loss_fp16
…nto bce_loss_fp16
|
||
self.inputs = {'X': input_np, 'Label': label_np} | ||
self.outputs = {'Out': output_np} | ||
|
||
def test_check_output(self): | ||
self.check_output() | ||
self.check_output(check_eager=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangting2020 请问老师 check_eager的含义和用途是什么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是框架升级过程中单测系统为了测试动态图加入的一个参数,不影响测试效果。
你需要看一下反向的计算精度问题,单测失败提示精度检查无法通过。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangting2020 检查了很久不知道哪里出问题了。目前看来是计算numeric_grads的时候跟预期相差较大, 例如在进行mean计算时 np.array([85.02881]).astype(np.float16) => 85.0, 导致pos和neg虽在float下有差异,但是float16二者取值都是85.0, 所以计算结果得到的梯度是0。 如果计算mean时,将输入设置成float32,就会得到梯度的值,且误差从0.42缩小到0.05. 希望老师进一步指导意见。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
计算numeric_grads的时候跟预期相差较大,这个是指单测框架的实现中的哪一部分,能否贴一下链接,有可能是单测框架上造成的理论梯度值有精度损失,我确认下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangting2020 老师您好, 我merge了最新的代码发现之前的op_test.py这个文件没有了,之前是在那个文件里面打印的相关输出进行数值对比的, 我把atol和max_relative_error都去掉之后,居然通过检查了,就好像没检查一样。请问这部分后来是做了大的调整吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1)应该是单测的一些文件目录做了调整,现在是在这个文件里了。python/paddle/fluid/tests/unittests/eager_op_test.py
(2)根据你描述的现象我比较担心可能会出现随机挂的问题。建议先在自己的开发环境上,把单测的shape调整几组,并且尝试重复运行单测比如100次:ctest -R test_bce_loss --repeat-until-fail 100
,如果这样能通过测试,那应该就没问题了。用ctest执行单测,需要编译的时候开启DWITH_TESTING,比如
cmake .. -DPY_VERSION=3.7 -DWITH_GPU=ON -DWITH_TESTING=ON -DCMAKE_BUILD_TYPE=Release -DWITH_DISTRIBUTE=OFF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhangting2020 运行命令ctest -R test_bce_loss --repeat-until-fail 100, 分别测试了[10, 10], [100, 100], [5000, 5000], [20, 30, 40, 50], 结果全部通过
… bce_loss_fp16
… bce_loss_fp16
@thunder95 需要修复下ROCM流水线,看了下历史记录都是挂的 |
@luotao1 谢谢, 才发现这里有个问题, 看日志的时候没拉完 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…addlePaddle#50930) * untracked files * bce_loss_fp16 * remove unused files * back max_rel_erro still big * simplify code * upd * fix max_relative_error * restart ci * Update test_bce_loss.py * Update test_bce_loss.py * Update test_bce_loss.py * Update test_bce_loss.py * try to pass test * restore file * remove error value * fix bug --------- Co-authored-by: Zhang Ting <Douyaer2020@qq.com>
PR types
Performance optimization
PR changes
OPs
Describe
为bce_loss 新增float16 数据类型
测试设备:RTX 2070s
目前bce_loss前向和反向推理性能测试:
中文API文档更新支持fp16数据类型: PaddlePaddle/docs#5704