Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[auto parallel] Add expand_v2 spmd rules #59432

Closed
wants to merge 18 commits into from

Conversation

MarioLulab
Copy link
Contributor

@MarioLulab MarioLulab commented Nov 28, 2023

PR types

New features

PR changes

Others

Description

add expand_v2 spmd rules

单测运行出错:

  1. test_expand_shard_0 单测报错:反向结果不正确
Traceback (most recent call last):
  File "semi_auto_parallel_for_expand.py", line 100, in <module>
    TestExpandApiForSemiAutoParallel().run_test_case()
  File "semi_auto_parallel_for_expand.py", line 94, in run_test_case
    self.test_expand_shard_0()
  File "semi_auto_parallel_for_expand.py", line 39, in test_expand_shard_0
    _, output = self.runfunc_and_check(
  File "/home/aistudio/Paddle-gpu/test/auto_parallel/semi_auto_parallel_util.py", line 140, in runfunc_and_check
    self.check_tensor_eq(x.grad, dist_x.grad)
  File "/home/aistudio/Paddle-gpu/test/auto_parallel/semi_auto_parallel_util.py", line 37, in check_tensor_eq
    np.testing.assert_allclose(np1, np2, rtol=1e-05, verbose=True)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 783, in assert_array_compare
    flagged = func_assert_same_pos(x, y, func=isnan, hasval='nan')
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 753, in func_assert_same_pos
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-05, atol=0

x and y nan location mismatch:
 x: array([[[8., 8., 8., 8., 8., 8., 8., 8.]],

       [[8., 8., 8., 8., 8., 8., 8., 8.]],...
 y: array([[[8.000018e+00, 8.000029e+00, 8.000034e+00, 8.000037e+00,
         8.000007e+00, 8.000067e+00, 8.000029e+00, 8.000051e+00]],
  1. test_expand_shard_on_0 单测报错:反向结果不正确
Traceback (most recent call last):
  File "semi_auto_parallel_for_expand.py", line 101, in <module>
    TestExpandApiForSemiAutoParallel().run_test_case()
  File "semi_auto_parallel_for_expand.py", line 96, in run_test_case
    self.test_expand_shard_on_0()
  File "semi_auto_parallel_for_expand.py", line 70, in test_expand_shard_on_0
    self.test_body(
  File "semi_auto_parallel_for_expand.py", line 66, in test_body
    self.check_tensor_eq(x.grad, dist_x.grad)
  File "/home/aistudio/Paddle-gpu/test/auto_parallel/semi_auto_parallel_util.py", line 37, in check_tensor_eq
    np.testing.assert_allclose(np1, np2, rtol=1e-05, verbose=True)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 789, in assert_array_compare
    flagged |= func_assert_same_pos(x, y,
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 753, in func_assert_same_pos
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-05, atol=0

x and y -inf location mismatch:
 x: array([[[8., 8., 8., 8., 8., 8., 8., 8.]],

       [[8., 8., 8., 8., 8., 8., 8., 8.]],...
 y: array([[[ 2.450679e+21,  4.940652e+30,  8.000000e+00,  8.000000e+00,
          1.766651e+22,  1.083532e+24,  8.000000e+00,  8.000000e+00]],
  1. test_expand_shard_on_2 单测报错:运行前向 expand 即报错
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_expand(_object*, _object*, _object*)
1   expand_ad_func(paddle::Tensor const&, paddle::experimental::IntArrayBase<paddle::Tensor>)
2   paddle::experimental::expand(paddle::Tensor const&, paddle::experimental::IntArrayBase<paddle::Tensor> const&)

----------------------
Error Message Summary:
----------------------
FatalError: `Erroneous arithmetic operation` is detected by the operating system.
  [TimeInfo: *** Aborted at 1701845505 (unix time) try "date -d @1701845505" if you are using GNU date ***]
  [SignalInfo: *** SIGFPE (@0x7fa47cf4ba9e) received by PID 114938 (TID 0x7fa4a9cd4740) from PID 2096413342 ***]

LAUNCH INFO 2023-12-06 06:51:46,082 Pod failed
LAUNCH ERROR 2023-12-06 06:51:46,082 Container failed !!!

Copy link

paddle-bot bot commented Nov 28, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Nov 28, 2023
@MarioLulab MarioLulab requested a review from cxxly December 4, 2023 03:32
@cxxly
Copy link
Contributor

cxxly commented Dec 5, 2023

参考squeeze, 配置下yaml,增加分布式测试用例


SpmdInfo ExpandInferSpmd(const DistMetaTensor& x,
const std::vector<int64_t>& shape);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和 ops.yaml 签名保持一致,使用IntArray

Copy link
Contributor Author

@MarioLulab MarioLulab Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解是 spmd_rules 的接口应该不需要改成 IntArray?因为代码自动生成的脚本(https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/api/yaml/generator/dist_api_gen.py#L851-L852) 会判断 ops.yaml 签名里是否有 IntArray,如果有的话,会在生成的 api.cc 和 backward.cc 里为 const IntArray& 类型自动添加 GetData() 获取对应的 vector<int64_t> 以适配 spmd_rules 接口,如图:
image

Copy link
Contributor

@GhostScreaming GhostScreaming Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要统一一下,包括ops.yaml、dist_api_gen.py、InferSPMD的shape,应该给std::vector<int64_t>或者 IntArray。如果改成IntArray的话,ReshapeInferSPMD的shape也需要相应改一下,避免dist_api_gen.py有两套实现

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要统一一下,包括ops.yaml、dist_api_gen.py、InferSPMD的shape,应该给std::vector<int64_t>或者 IntArray。如果改成IntArray的话,ReshapeInferSPMD的shape也需要相应改一下,避免dist_api_gen.py有两套实现

那我这里还是先不做改动,保留 std::vector<int64_t> ?这个 pr 先完善 expand 的 spmd,下一个 pr 再做统一的工作?

SpmdInfo ExpandInferSpmdReverse(const DistMetaTensor& x,
const DistMetaTensor& out,
const std::vector<int64_t>& shape);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@@ -558,6 +558,12 @@ def is_reshape_kernel(self):
and 'grad' not in self.kernel['func'][0]
)

def is_expand_kernel(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议和is_reshape_kernel整合成一个函数need_calculate_local_shape,给一个白名单['reshape', 'expand'],在白名单上的kernel才需要特别处理。后续需要计算local_shape的kernel加到白名单上即可。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,将会完善这个 pr

Copy link

paddle-ci-bot bot commented Dec 29, 2023

Sorry to inform you that 33acb63's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@paddle-bot paddle-bot bot closed this Dec 31, 2024
Copy link

paddle-bot bot commented Jan 6, 2025

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants