[auto parallel] Add expand_v2 spmd rules #59432

MarioLulab · 2023-11-28T05:11:58Z

PR types

New features

PR changes

Others

Description

add expand_v2 spmd rules

单测运行出错：

test_expand_shard_0 单测报错：反向结果不正确

Traceback (most recent call last):
  File "semi_auto_parallel_for_expand.py", line 100, in <module>
    TestExpandApiForSemiAutoParallel().run_test_case()
  File "semi_auto_parallel_for_expand.py", line 94, in run_test_case
    self.test_expand_shard_0()
  File "semi_auto_parallel_for_expand.py", line 39, in test_expand_shard_0
    _, output = self.runfunc_and_check(
  File "/home/aistudio/Paddle-gpu/test/auto_parallel/semi_auto_parallel_util.py", line 140, in runfunc_and_check
    self.check_tensor_eq(x.grad, dist_x.grad)
  File "/home/aistudio/Paddle-gpu/test/auto_parallel/semi_auto_parallel_util.py", line 37, in check_tensor_eq
    np.testing.assert_allclose(np1, np2, rtol=1e-05, verbose=True)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 783, in assert_array_compare
    flagged = func_assert_same_pos(x, y, func=isnan, hasval='nan')
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 753, in func_assert_same_pos
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-05, atol=0

x and y nan location mismatch:
 x: array([[[8., 8., 8., 8., 8., 8., 8., 8.]],

       [[8., 8., 8., 8., 8., 8., 8., 8.]],...
 y: array([[[8.000018e+00, 8.000029e+00, 8.000034e+00, 8.000037e+00,
         8.000007e+00, 8.000067e+00, 8.000029e+00, 8.000051e+00]],

test_expand_shard_on_0 单测报错：反向结果不正确

Traceback (most recent call last):
  File "semi_auto_parallel_for_expand.py", line 101, in <module>
    TestExpandApiForSemiAutoParallel().run_test_case()
  File "semi_auto_parallel_for_expand.py", line 96, in run_test_case
    self.test_expand_shard_on_0()
  File "semi_auto_parallel_for_expand.py", line 70, in test_expand_shard_on_0
    self.test_body(
  File "semi_auto_parallel_for_expand.py", line 66, in test_body
    self.check_tensor_eq(x.grad, dist_x.grad)
  File "/home/aistudio/Paddle-gpu/test/auto_parallel/semi_auto_parallel_util.py", line 37, in check_tensor_eq
    np.testing.assert_allclose(np1, np2, rtol=1e-05, verbose=True)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 789, in assert_array_compare
    flagged |= func_assert_same_pos(x, y,
  File "/usr/local/lib/python3.8/dist-packages/numpy/testing/_private/utils.py", line 753, in func_assert_same_pos
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-05, atol=0

x and y -inf location mismatch:
 x: array([[[8., 8., 8., 8., 8., 8., 8., 8.]],

       [[8., 8., 8., 8., 8., 8., 8., 8.]],...
 y: array([[[ 2.450679e+21,  4.940652e+30,  8.000000e+00,  8.000000e+00,
          1.766651e+22,  1.083532e+24,  8.000000e+00,  8.000000e+00]],

test_expand_shard_on_2 单测报错：运行前向 expand 即报错

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_expand(_object*, _object*, _object*)
1   expand_ad_func(paddle::Tensor const&, paddle::experimental::IntArrayBase<paddle::Tensor>)
2   paddle::experimental::expand(paddle::Tensor const&, paddle::experimental::IntArrayBase<paddle::Tensor> const&)

----------------------
Error Message Summary:
----------------------
FatalError: `Erroneous arithmetic operation` is detected by the operating system.
  [TimeInfo: *** Aborted at 1701845505 (unix time) try "date -d @1701845505" if you are using GNU date ***]
  [SignalInfo: *** SIGFPE (@0x7fa47cf4ba9e) received by PID 114938 (TID 0x7fa4a9cd4740) from PID 2096413342 ***]

LAUNCH INFO 2023-12-06 06:51:46,082 Pod failed
LAUNCH ERROR 2023-12-06 06:51:46,082 Container failed !!!

paddle-bot · 2023-11-28T05:12:05Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…uqi/spmd_expand

cxxly · 2023-12-05T02:25:01Z

参考squeeze, 配置下yaml，增加分布式测试用例

paddle/phi/infermeta/spmd_rules/expand.h

cxxly · 2023-12-05T02:33:52Z

paddle/phi/infermeta/spmd_rules/expand.h

+
+SpmdInfo ExpandInferSpmd(const DistMetaTensor& x,
+                         const std::vector<int64_t>& shape);
+


和 ops.yaml 签名保持一致，使用IntArray

我的理解是 spmd_rules 的接口应该不需要改成 IntArray？因为代码自动生成的脚本（https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/phi/api/yaml/generator/dist_api_gen.py#L851-L852）会判断 ops.yaml 签名里是否有 IntArray，如果有的话，会在生成的 api.cc 和 backward.cc 里为 const IntArray& 类型自动添加 GetData() 获取对应的 vector<int64_t> 以适配 spmd_rules 接口，如图：

这里需要统一一下，包括ops.yaml、dist_api_gen.py、InferSPMD的shape，应该给std::vector<int64_t>或者 IntArray。如果改成IntArray的话，ReshapeInferSPMD的shape也需要相应改一下，避免dist_api_gen.py有两套实现

这里需要统一一下，包括ops.yaml、dist_api_gen.py、InferSPMD的shape，应该给std::vector<int64_t>或者 IntArray。如果改成IntArray的话，ReshapeInferSPMD的shape也需要相应改一下，避免dist_api_gen.py有两套实现

那我这里还是先不做改动，保留 std::vector<int64_t> ？这个 pr 先完善 expand 的 spmd，下一个 pr 再做统一的工作？

cxxly · 2023-12-05T02:34:03Z

paddle/phi/infermeta/spmd_rules/expand.h

+SpmdInfo ExpandInferSpmdReverse(const DistMetaTensor& x,
+                                const DistMetaTensor& out,
+                                const std::vector<int64_t>& shape);
+


GhostScreaming · 2023-12-06T06:43:12Z

paddle/phi/api/yaml/generator/dist_api_gen.py

@@ -558,6 +558,12 @@ def is_reshape_kernel(self):
            and 'grad' not in self.kernel['func'][0]
        )

+    def is_expand_kernel(self):


这里建议和is_reshape_kernel整合成一个函数need_calculate_local_shape，给一个白名单['reshape', 'expand']，在白名单上的kernel才需要特别处理。后续需要计算local_shape的kernel加到白名单上即可。

收到，将会完善这个 pr

paddle-ci-bot · 2023-12-29T03:18:35Z

Sorry to inform you that 33acb63's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

paddle-bot · 2025-01-06T07:01:16Z

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复，我们将关闭这个issue/pr。
若问题未解决或有后续问题，请随时重新打开，我们会继续跟进。

add dummy files

9b1f102

paddle-bot bot added the contributor External developers label Nov 28, 2023

MarioLulab added 13 commits November 28, 2023 06:02

add function interfaces

0c38d40

add DimTrans::Broadcast

d477d12

modify broadcast DimTrans

c6a6bf2

add expand spmd rules

560f012

Merge branch 'develop' into luqi/spmd_expand

b3c41bf

solve shared_ptr conflict

7bec917

Merge branch 'develop' into luqi/spmd_expand

df73ecc

fix rules

02ba861

add testcases for spmd rules

bdfc89d

code polish

3089c0d

fix for ci

40f76d5

add test into ci test

668cf21

Merge branch 'develop' of https://github.com/MarioLulab/Paddle into l…

d89827f

…uqi/spmd_expand

MarioLulab requested a review from cxxly December 4, 2023 03:32

cxxly reviewed Dec 5, 2023

View reviewed changes

paddle/phi/infermeta/spmd_rules/expand.h Show resolved Hide resolved

cxxly reviewed Dec 5, 2023

View reviewed changes

MarioLulab added 3 commits December 5, 2023 15:46

add testcases for parallel

0eb9ea3

bind spmd rules into yaml files

3084708

modify dist_api_gen.py

2c35242

MarioLulab force-pushed the luqi/spmd_expand branch from 602392f to 2c35242 Compare December 6, 2023 06:36

GhostScreaming reviewed Dec 6, 2023

View reviewed changes

update testcases

33acb63

MarioLulab mentioned this pull request Dec 6, 2023

[WeeklyReports] 2023.11.22~2023.12.05 周报汇总 PFCCLab/Camp#102

Closed

20 tasks

paddle-bot bot closed this Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[auto parallel] Add expand_v2 spmd rules #59432

[auto parallel] Add expand_v2 spmd rules #59432

MarioLulab commented Nov 28, 2023 •

edited

Loading

paddle-bot bot commented Nov 28, 2023

cxxly commented Dec 5, 2023

cxxly Dec 5, 2023

MarioLulab Dec 6, 2023 •

edited

Loading

GhostScreaming Dec 6, 2023 •

edited

Loading

MarioLulab Dec 6, 2023

cxxly Dec 5, 2023

GhostScreaming Dec 6, 2023

MarioLulab Dec 6, 2023

paddle-ci-bot bot commented Dec 29, 2023

paddle-bot bot commented Jan 6, 2025


		SpmdInfo ExpandInferSpmd(const DistMetaTensor& x,
		const std::vector<int64_t>& shape);

[auto parallel] Add expand_v2 spmd rules #59432

[auto parallel] Add expand_v2 spmd rules #59432

Conversation

MarioLulab commented Nov 28, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Nov 28, 2023

cxxly commented Dec 5, 2023

cxxly Dec 5, 2023

Choose a reason for hiding this comment

MarioLulab Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

GhostScreaming Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

MarioLulab Dec 6, 2023

Choose a reason for hiding this comment

cxxly Dec 5, 2023

Choose a reason for hiding this comment

GhostScreaming Dec 6, 2023

Choose a reason for hiding this comment

MarioLulab Dec 6, 2023

Choose a reason for hiding this comment

paddle-ci-bot bot commented Dec 29, 2023

paddle-bot bot commented Jan 6, 2025

MarioLulab commented Nov 28, 2023 •

edited

Loading

MarioLulab Dec 6, 2023 •

edited

Loading

GhostScreaming Dec 6, 2023 •

edited

Loading