-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CodeStyle][NPU] use np.testing.assert_allclose instead of self.assertTrue(np.allclose(...)) (part 1) #44988
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
7797308
to
cd0a2e6
Compare
可以贴一下哪些单测的静态图代码报错么? |
fb7670f
to
f5d67ab
Compare
b0ce5e5
to
4805da5
Compare
很抱歉,经过我们的反复讨论,你的PR暂未达到合入标准,请阅读飞桨原生算子开发规范,你可以重新提交新的PR,我们先将此PR关闭,感谢你的贡献。 |
ad8a787
to
179c9b6
Compare
179c9b6
to
8b3520a
Compare
7702627
to
70484b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
由于对于XPU相关修改比较多,和昆仑同学确认,请提交一条以“xxxx, test=kunlun” 结尾的的commit msg,会自动触发昆仑流水线进行单测检查。 |
CI passed for MLU and IPU. |
import PaddlePaddle#45079 changes, try fix 'PreconditionNotMetError: the memcpy_h2d op has no valid phi kernel.'
from paddle._C_ops import final_state_nearest_interp | ||
from paddle import _C_ops |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是因为原来 NPU 下单测 test_nearest_interp_v2_op_npu
在从 test_nearest_interp_v2_op
import numpy 的参考实现(nearest_neighbor_interp_np
)时候会报错,报错如下:
2022-08-17 00:32:35 1/1 Test #93: test_nearest_interp_v2_op_npu ....***Failed 3.95 sec
2022-08-17 00:32:35 W0817 00:32:32.922722 813 init.cc:288] AVX is available, Please re-compile on local machine
2022-08-17 00:32:35 Hint: Your machine support AVX, but the installed paddlepaddle doesn't have avx core. Hence, no-avx core with worse preformance will be imported.
2022-08-17 00:32:35 If you like, you could reinstall paddlepaddle by 'python -m pip install --force-reinstall paddlepaddle-gpu[==version]' to get better performance.
2022-08-17 00:32:35 The original error is: cannot import name 'core_avx' from 'paddle.fluid' (/workspace/npu_dev/Paddle/build/python/paddle/fluid/__init__.py)
2022-08-17 00:32:35 /workspace/npu_dev/Paddle/build/python/paddle/fluid/framework.py:189: UserWarning: We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently.
2022-08-17 00:32:35 "We will fallback into legacy dygraph on NPU/XPU/MLU/IPU/ROCM devices. Because we only support new eager dygraph mode on CPU/GPU currently. "
2022-08-17 00:32:35 Traceback (most recent call last):
2022-08-17 00:32:35 File "/workspace/npu_dev/Paddle/tools/test_runner.py", line 69, in <module>
2022-08-17 00:32:35 main()
2022-08-17 00:32:35 File "/workspace/npu_dev/Paddle/tools/test_runner.py", line 51, in main
2022-08-17 00:32:35 module = importlib.import_module(module_name)
2022-08-17 00:32:35 File "/opt/conda/lib/python3.7/importlib/__init__.py", line 127, in import_module
2022-08-17 00:32:35 return _bootstrap._gcd_import(name[level:], package, level)
2022-08-17 00:32:35 File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
2022-08-17 00:32:35 File "<frozen importlib._bootstrap>", line 983, in _find_and_load
2022-08-17 00:32:35 File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
2022-08-17 00:32:35 File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
2022-08-17 00:32:35 File "<frozen importlib._bootstrap_external>", line 728, in exec_module
2022-08-17 00:32:35 File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
2022-08-17 00:32:35 File "/workspace/npu_dev/Paddle/build/python/paddle/fluid/tests/unittests/npu/test_nearest_interp_v2_op_npu.py", line 29, in <module>
2022-08-17 00:32:35 from test_nearest_interp_v2_op import nearest_neighbor_interp_np
2022-08-17 00:32:35 File "../test_nearest_interp_v2_op.py", line 25, in <module>
2022-08-17 00:32:35 from paddle._C_ops import final_state_nearest_interp
2022-08-17 00:32:35 ImportError: cannot import name 'final_state_nearest_interp' from 'paddle.fluid.core_noavx.ops' (/workspace/npu_dev/Paddle/build/python/paddle/_C_ops.py)
2022-08-17 00:32:35 Errors while running CTest
2022-08-17 00:32:35 0% tests passed, 1 tests failed out of 1
2022-08-17 00:32:35 Total Test time (real) = 3.96 sec
2022-08-17 00:32:35 The following tests FAILED:
2022-08-17 00:32:35 93 - test_nearest_interp_v2_op_npu (Failed)
看起来是刚刚 merge 的 #45148 新增的这一行(这里的 25 行)直接从 _C_ops import 了 final_state_nearest_interp
,但在 NPU 下貌似是访问不到的。为了避免这个问题,修改为该文件全局仅 import 到 _C_ops,修改后 NPU 测试全部通过。
最新一次 commit 包含了 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Others
PR changes
Others
Describe
使用
np.testing.assert_allclose(...)
替换self.assertTrue(np.allclose(...))
,以优化单测报错信息Changes
self.assertTrue(np.allclose(...))
->np.testing.assert_allclose(...)
autofix(自动跳过因为 shape 问题失败的 case)self.assertEqual(np.allclose(...), True)
->np.testing.assert_allclose(...)
autofixself.assertTrue(numpy.allclose(...))
->np.testing.assert_allclose(...)
autofixself.assertTrue(np.allclose(...), True)
移除True
autofixself.assertTrue(np.isclose(...).all())
->np.testing.assert_allclose(...)
autofix1e-4
变成0.0001
了,明显1e-4
更好一些(包含比1e-4
小的,比1e-4
大的暂不替换)err_msg
,部分err_msg
只是打印了 a 和 b,np.testing.allclose
本就会打印它们,因此可以清理rtol=0.5
的情况下 CI 仍然不过时,手动对这些测试增加atol=1e-8
(仍然是保持默认行为),未设过1e-8
之外的数值[0]
,如果暂时不确定是不是静态图返回值问题则暂时回退修改,留到下个 PR 处理目前直接替换后发现大量测试无法通过(375),这基本是没有办法手动一个个修改的……问题主要在以下几个方面:
精度误差(本 PR 将集中解决本问题)
这是由于两者默认值不同
np.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
np.testing.assert_allclose(actual, desired, rtol=1e-07, atol=0, equal_nan=True, err_msg='', verbose=True)
明显
np.testing.assert_allclose
精度要求更高,所以这里暂时np.allclose
无rtol
(即设为默认值)时将值设为1e-5
,使其与原来行为一致,atol
暂时不改看看情况修改后测试失败大大降低(375 -> 195),基本降了一半,余量已经很少了,经检查,余量大多都是些误差非常小的(1e-10 以下),针对这些手动加上
atol=1e-8
应可解决即可解决shape 不对齐
本 PR 暂时不关注本问题,将在下一个 PR 详细描述~
这是由于
np.allclose
在比较时会自动 broadcast,而np.testing.allclose
不会,因此需要手动对这些数据进行检查及修改修改精度问题后,本问题占了 90% 以上,也就是将近 200 个需要考虑本问题,逐个手动修复,需要进一步评估成本
目前发现的问题:
静态图返回是一个 list,但有的开发者直接将返回值进行比较,这会在比较时认为静态图的 shape 比预期值多一维度,如下面这个(找个问题好麻烦呀……),该问题占了很大一部分
Paddle/python/paddle/fluid/tests/unittests/test_softmax2d.py
Lines 39 to 41 in c91aace
误操作,这里的 expected_dx 不需要添加
[0]
Paddle/python/paddle/fluid/tests/unittests/test_imperative_double_grad.py
Lines 83 to 86 in c91aace
静态图代码报错(shape 不对齐问题导致测试不通过而无法执行后面的转为静态图模式的代码,进而后续静态图测试全部无法通过,本质上是问题二)
剩余大概 48 个文件是还有需要修改的(存在大量 shape 问题和可能存在少量精度问题和其他问题),将在下个 PR 进行修复。