-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
force sync batch norm grad sequential #52268
force sync batch norm grad sequential #52268
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
❌ The PR is not created using PR's template. You can refer to this Demo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments
// Node Construction | ||
auto grad_node = | ||
std::shared_ptr<SyncBatchNormGradNode>(new SyncBatchNormGradNode(6, 5)); | ||
egr::Controller::Instance().PushBackForceSequentialNodes(grad_node.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddForceSequentialNodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我觉得还是PushBack好,代表了有序的~
force_sequential_nodes_.pop(); | ||
} | ||
} | ||
void PushBackForceSequentialNodes(GradNodeBase* node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AddForceSequentialNodes
@@ -111,6 +111,22 @@ std::vector<paddle::Tensor> RunBackward( | |||
const std::vector<paddle::Tensor>& no_grad_vars = {}) { | |||
VLOG(3) << "Start Backward"; | |||
|
|||
std::queue<GradNodeBase*> force_sequential_nodes_forward_queue = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seal this into one function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下个PR来处理吧~
} | ||
}; | ||
|
||
if (force_sequential_nodes_set.count(next_node)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about give a quick path for empty force_sequential_nodes_set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下个PR来处理吧~
} else { | ||
queue.push_back(std::move(next_node)); | ||
ready_force_sequential_nodes.insert(next_node); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
下个PR来处理吧~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and fix problem in next pr ASAP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
Others
Describe
本PR用于修复SyncBatchNorm的前向执行顺序和反向的执行顺序错乱,导致的hang或者精度不对的问题。
由于控制流原因,多卡时,每个卡执行的逻辑可能是不一致的,算子数量肯定也是不一致的。
而前向算子的执行顺序,包括SyncBN的执行顺序都是由Python逻辑来控制的。
而反向图算子的执行顺序,是DAG的拓扑序,具体逻辑是框架通过“广度优先搜索(在算子的input全部准备好的情况下)”来确定顺序的。那么每个SyncBNGrad的执行顺序,由他举例Loss的“距离”决定。
而由于前向控制流原因,导致每张卡SyncBNGrad的顺序可能是不一致的,这可能导致hang或者精度不对。