force sync batch norm grad sequential #52268

wanghuancoder · 2023-03-29T02:58:22Z

PR types

Bug fixes

PR changes

Others

Describe

本PR用于修复SyncBatchNorm的前向执行顺序和反向的执行顺序错乱，导致的hang或者精度不对的问题。
由于控制流原因，多卡时，每个卡执行的逻辑可能是不一致的，算子数量肯定也是不一致的。
而前向算子的执行顺序，包括SyncBN的执行顺序都是由Python逻辑来控制的。
而反向图算子的执行顺序，是DAG的拓扑序，具体逻辑是框架通过“广度优先搜索（在算子的input全部准备好的情况下）”来确定顺序的。那么每个SyncBNGrad的执行顺序，由他举例Loss的“距离”决定。
而由于前向控制流原因，导致每张卡SyncBNGrad的顺序可能是不一致的，这可能导致hang或者精度不对。

paddle-bot · 2023-03-29T02:58:27Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot · 2023-03-29T02:58:28Z

❌ The PR is not created using PR's template. You can refer to this Demo.
Please use PR's template, it helps save our maintainers' time so that more developers get helped.

JiabinYang

some comments

JiabinYang · 2023-03-29T11:27:30Z

paddle/fluid/eager/api/manual/eager_manual/forwards/sync_batch_norm_fwd_func.cc

+    // Node Construction
+    auto grad_node =
+        std::shared_ptr<SyncBatchNormGradNode>(new SyncBatchNormGradNode(6, 5));
+    egr::Controller::Instance().PushBackForceSequentialNodes(grad_node.get());


AddForceSequentialNodes?

我觉得还是PushBack好，代表了有序的~

JiabinYang · 2023-03-29T11:28:02Z

paddle/fluid/eager/api/utils/global_utils.h

+      force_sequential_nodes_.pop();
+    }
+  }
+  void PushBackForceSequentialNodes(GradNodeBase* node) {


AddForceSequentialNodes

JiabinYang · 2023-03-29T11:28:38Z

paddle/fluid/eager/backward.cc

@@ -111,6 +111,22 @@ std::vector<paddle::Tensor> RunBackward(
    const std::vector<paddle::Tensor>& no_grad_vars = {}) {
  VLOG(3) << "Start Backward";

+  std::queue<GradNodeBase*> force_sequential_nodes_forward_queue =


Seal this into one function

下个PR来处理吧~

JiabinYang · 2023-03-29T11:30:19Z

paddle/fluid/eager/backward.cc

+          }
+        };
+
+        if (force_sequential_nodes_set.count(next_node)) {


how about give a quick path for empty force_sequential_nodes_set

下个PR来处理吧~

JiabinYang · 2023-03-29T11:39:48Z

paddle/fluid/eager/backward.cc

          } else {
-            queue.push_back(std::move(next_node));
+            ready_force_sequential_nodes.insert(next_node);
+            continue;


下个PR来处理吧~

JiabinYang

LGTM and fix problem in next pr ASAP

XiaoguangHu01

LGTM

force sync batch norm grad sequential

f304bef

wanghuancoder added 4 commits March 29, 2023 03:07

refine

05cfe32

refine

ed99120

refine

c9484b0

refine

b48bfb6

JiabinYang reviewed Mar 29, 2023

View reviewed changes

JiabinYang approved these changes Mar 30, 2023

View reviewed changes

chenwhql approved these changes Mar 30, 2023

View reviewed changes

XiaoguangHu01 approved these changes Mar 30, 2023

View reviewed changes

wanghuancoder merged commit 336160c into PaddlePaddle:develop Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

force sync batch norm grad sequential #52268

force sync batch norm grad sequential #52268

Uh oh!

wanghuancoder commented Mar 29, 2023 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 29, 2023

Uh oh!

paddle-bot bot commented Mar 29, 2023

Uh oh!

JiabinYang left a comment

Uh oh!

JiabinYang Mar 29, 2023

Uh oh!

wanghuancoder Mar 30, 2023

Uh oh!

JiabinYang Mar 29, 2023

Uh oh!

JiabinYang Mar 29, 2023

Uh oh!

wanghuancoder Mar 30, 2023

Uh oh!

JiabinYang Mar 29, 2023

Uh oh!

wanghuancoder Mar 30, 2023

Uh oh!

JiabinYang Mar 29, 2023

Uh oh!

wanghuancoder Mar 30, 2023

Uh oh!

JiabinYang left a comment

Uh oh!

XiaoguangHu01 left a comment

Uh oh!

Uh oh!

force sync batch norm grad sequential #52268

force sync batch norm grad sequential #52268

Uh oh!

Conversation

wanghuancoder commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

paddle-bot bot commented Mar 29, 2023

Uh oh!

paddle-bot bot commented Mar 29, 2023

Uh oh!

JiabinYang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiabinYang left a comment

Choose a reason for hiding this comment

Uh oh!

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wanghuancoder commented Mar 29, 2023 •

edited

Loading