-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33906][WEBUI] Fix the bug of UI Executor page stuck due to undefined peakMemoryMetrics #30920
Conversation
Kubernetes integration test starting |
Hi, @baohe-zhang . Is this caused by #30186 ? |
Thank you for your contribution, @baohe-zhang . Although this sounds like a serious regression in terms of UI, the code review might take a longer time because it's a holiday season. Happy Christmas and New Year! |
Thank you, happy Christmas! |
Kubernetes integration test status failure |
Test build #133360 has finished for PR 30920 at commit
|
Retest this please. |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #133363 has finished for PR 30920 at commit
|
Hi @baohe-zhang , could you check this PR? |
We can reproduce this issue with |
Oh, my bad, seem I referenced this part code when making my pr, but change my part, didn't notice these code have problem too. A good catch. |
Hi @sarutak that PR fixed the same issue in stages tab, and the current PR fix issue in executors tab. Should I handle the case of peakMemoryMetrics undefined similar to that PR, like return '0.0 B / 0.0 B' instead of empty string? |
@baohe-zhang |
@@ -414,6 +414,9 @@ $(document).ready(function () { | |||
}, | |||
{ | |||
data: function (row, type) { | |||
if (typeof row.peakMemoryMetrics == "undefined") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use ===
and !==
rather than ==
and !=
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@sarutak I updated it to show N/A when undefined. |
formatBytes(peakMemoryMetrics.JVMOffHeapMemory, type)); | ||
} else { | ||
if (type !== 'display') { | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to return ""
rather than 0
like done here.
Could you update the similar logic in the previous PR too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually in https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/stagepage.js#L501, it also returns 0.
If the column is invisible, it seems fine to return anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that returns 0 but I wonder it's impossible for us to tell that "the metrics is not updated yet" from "that the metrics is actually 0".
Kubernetes integration test starting |
Kubernetes integration test status failure |
Sorry I suggested |
Test build #133374 has finished for PR 30920 at commit
|
cc @gengliangwang as well. |
if (type !== 'display') { | ||
return 0; | ||
} else { | ||
return 'N/A'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about returning '0.0 B / 0.0 B'
like
https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/stagepage.js#L503
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Replaced 'N/A' with '0.0 B / 0.0 B'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
Test build #133455 has finished for PR 30920 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern mentioned here but it's better to focus to resolve the stuck issue for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you all. I agree with you. Let's fix this stuck issue first
…efined peakMemoryMetrics ### What changes were proposed in this pull request? Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page. ### Why are the changes needed? App live UI may stuck at Executors page without this fix. Steps to reproduce (with master branch): In mac OS standalone mode, open a spark-shell $SPARK_HOME/bin/spark-shell --master spark://localhost:7077 val x = sc.makeRDD(1 to 100000, 5) x.count() Then open the app UI in the browser, and click the Executors page, will get stuck at this page: ![image](https://user-images.githubusercontent.com/26694233/103105677-ca1a7380-45f4-11eb-9245-c69f4a4e816b.png) Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906. I debugged it and observed that ExecutorMetricsPoller .getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test, rerun the steps of bug reproduce and see the bug is gone. Closes #30920 from baohe-zhang/SPARK-33906. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 45df6db) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Since this is caused by SPARK-23432 (#30186), this is backported to branch-3.1 to unblock Apache Spark RC1. |
Late +1. Thanks @baohe-zhang for the fix! |
Thanks all. I am going to start working on cutting RC1 today. |
What changes were proposed in this pull request?
Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page.
Why are the changes needed?
App live UI may stuck at Executors page without this fix.
Steps to reproduce (with master branch):
In mac OS standalone mode, open a spark-shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077
val x = sc.makeRDD(1 to 100000, 5)
x.count()
Then open the app UI in the browser, and click the Executors page, will get stuck at this page:
Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906.
I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual test, rerun the steps of bug reproduce and see the bug is gone.