Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33906][WEBUI] Fix the bug of UI Executor page stuck due to undefined peakMemoryMetrics #30920

Closed
wants to merge 3 commits into from

Conversation

baohe-zhang
Copy link

@baohe-zhang baohe-zhang commented Dec 24, 2020

What changes were proposed in this pull request?

Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page.

Why are the changes needed?

App live UI may stuck at Executors page without this fix.
Steps to reproduce (with master branch):
In mac OS standalone mode, open a spark-shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077

val x = sc.makeRDD(1 to 100000, 5)
x.count()

Then open the app UI in the browser, and click the Executors page, will get stuck at this page:
image

Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906.

I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test, rerun the steps of bug reproduce and see the bug is gone.

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37951/

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 24, 2020

Hi, @baohe-zhang . Is this caused by #30186 ?

@dongjoon-hyun
Copy link
Member

Thank you for your contribution, @baohe-zhang . Although this sounds like a serious regression in terms of UI, the code review might take a longer time because it's a holiday season.

Happy Christmas and New Year!

@baohe-zhang
Copy link
Author

Thank you, happy Christmas!

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37951/

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Test build #133360 has finished for PR 30920 at commit c3ab7a1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Dec 24, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37954/

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37954/

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Test build #133363 has finished for PR 30920 at commit c3ab7a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member

sarutak commented Dec 25, 2020

Hi @baohe-zhang , could you check this PR?
That PR solves similar issue.

@sarutak
Copy link
Member

sarutak commented Dec 25, 2020

We can reproduce this issue with --master local-cluster[1,1,1024] more easily.

@AngersZhuuuu
Copy link
Contributor

Hi @baohe-zhang , could you check this PR?
That PR solves similar issue.

Oh, my bad, seem I referenced this part code when making my pr, but change my part, didn't notice these code have problem too. A good catch.

@baohe-zhang
Copy link
Author

Hi @sarutak that PR fixed the same issue in stages tab, and the current PR fix issue in executors tab. Should I handle the case of peakMemoryMetrics undefined similar to that PR, like return '0.0 B / 0.0 B' instead of empty string?

@sarutak
Copy link
Member

sarutak commented Dec 25, 2020

@baohe-zhang
I think it's better to do with the same way as that PR.
But I wonder it's more proper to show N/A rather than 0.0B / 0.0B to intend it's not updated yet when peakMemoryMetrics is undefined.

@@ -414,6 +414,9 @@ $(document).ready(function () {
},
{
data: function (row, type) {
if (typeof row.peakMemoryMetrics == "undefined") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use === and !== rather than == and !=.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@baohe-zhang
Copy link
Author

@sarutak I updated it to show N/A when undefined.

formatBytes(peakMemoryMetrics.JVMOffHeapMemory, type));
} else {
if (type !== 'display') {
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to return "" rather than 0 like done here.

Could you update the similar logic in the previous PR too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in https://github.com/apache/spark/blob/master/core/src/main/resources/org/apache/spark/ui/static/stagepage.js#L501, it also returns 0.
If the column is invisible, it seems fine to return anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that returns 0 but I wonder it's impossible for us to tell that "the metrics is not updated yet" from "that the metrics is actually 0".

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37965/

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37965/

@sarutak
Copy link
Member

sarutak commented Dec 25, 2020

Sorry I suggested N/A but "" seems to be preferred in other places for now so let's go with "" for now for consistency.

@SparkQA
Copy link

SparkQA commented Dec 25, 2020

Test build #133374 has finished for PR 30920 at commit e7154fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

cc @gengliangwang as well.

if (type !== 'display') {
return 0;
} else {
return 'N/A';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced 'N/A' with '0.0 B / 0.0 B'

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@SparkQA
Copy link

SparkQA commented Dec 28, 2020

Test build #133455 has finished for PR 30920 at commit 2f7b9be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@sarutak sarutak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern mentioned here but it's better to focus to resolve the stuck issue for now.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you all. I agree with you. Let's fix this stuck issue first

dongjoon-hyun pushed a commit that referenced this pull request Dec 31, 2020
…efined peakMemoryMetrics

### What changes were proposed in this pull request?
Check if the executorSummary.peakMemoryMetrics is defined before accessing it. Without checking, the UI has risked being stuck at the Executors page.

### Why are the changes needed?
App live UI may stuck at Executors page without this fix.
Steps to reproduce (with master branch):
In mac OS standalone mode, open a spark-shell
$SPARK_HOME/bin/spark-shell --master spark://localhost:7077

val x = sc.makeRDD(1 to 100000, 5)
x.count()

Then open the app UI in the browser, and click the Executors page, will get stuck at this page:
![image](https://user-images.githubusercontent.com/26694233/103105677-ca1a7380-45f4-11eb-9245-c69f4a4e816b.png)

Also, the return JSON from API endpoint http://localhost:4040/api/v1/applications/app-20201224134418-0003/executors miss "peakMemoryMetrics" for executor objects. I attached the full json text in https://issues.apache.org/jira/browse/SPARK-33906.

I debugged it and observed that ExecutorMetricsPoller
.getExecutorUpdates returns an empty map, which causes peakExecutorMetrics to None in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L345. The possible reason for returning the empty map is that the stage completion time is shorter than the heartbeat interval, so the stage entry in stageTCMP has already been removed before the reportHeartbeat is called.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manual test, rerun the steps of bug reproduce and see the bug is gone.

Closes #30920 from baohe-zhang/SPARK-33906.

Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 45df6db)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Since this is caused by SPARK-23432 (#30186), this is backported to branch-3.1 to unblock Apache Spark RC1.

@HeartSaVioR
Copy link
Contributor

Late +1. Thanks @baohe-zhang for the fix!

@HyukjinKwon
Copy link
Member

Thanks all. I am going to start working on cutting RC1 today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants