[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369

gengliangwang · 2019-08-06T17:11:53Z

What changes were proposed in this pull request?

Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data.
However, the number of "SUCCESS" tasks is wrong when there are running tasks. In AppStatusStore, the KVStore is ElementTrackingStore, instead of InMemoryStore. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks.
Thus, even when the running tasks are finished, the out-of-update cached data is returned.

This PR is to fix the code in getting the number of "SUCCESS" tasks.

How was this patch tested?

Test manually, run

sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect()

and keep refreshing the stage page , we can see the task summary metrics is wrong.

Before fix:

After fix:

gengliangwang · 2019-08-06T17:13:49Z

cc @vanzin

SparkQA · 2019-08-06T18:38:11Z

Test build #108727 has finished for PR 25369 at commit d0dbded.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

The test failures look legitimate.

vanzin · 2019-08-06T21:27:20Z

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

@@ -156,7 +156,8 @@ private[spark] class AppStatusStore(
    // cheaper for disk stores (avoids deserialization).
    val count = {
      Utils.tryWithResource(
-        if (store.isInstanceOf[InMemoryStore]) {
+        if (store.isInstanceOf[ElementTrackingStore]) {


I think a new method private def isLiveUI: Boolean would make the intent here clearer. It can be this instance check or just checking whether listener is defined.

@vanzin Thanks for the suggestion. I have updated the code.

SparkQA · 2019-08-07T08:40:09Z

Test build #108751 has finished for PR 25369 at commit 18bde1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-08-07T08:43:55Z

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

@@ -156,7 +162,8 @@ private[spark] class AppStatusStore(
    // cheaper for disk stores (avoids deserialization).
    val count = {
      Utils.tryWithResource(
-        if (store.isInstanceOf[InMemoryStore]) {
+        if (isInMemoryStore) {


After reading the comment in https://github.com/apache/spark/pull/23088/files#diff-3bd1667878f7bda9a56f95e93a80b475R233, I think it is on purpose that only count the "SUCCESS" task when with InMemoryStore.
cc @shahidki31

Thanks @gengliangwang . History server uses InMemory store by default. For Disk store case, this isn't an optimal way for finding success task. I am yet to raise a PR for supporting for Disk store case.

I think you just need to add,
isInMemoryStore: Boolean = listener.isDefined || store.isInstanceOf[InMemoryStore]

Will test your code with all the scenarios.
Any case, this would be temporary as we need to support for Diskstore also.

@shahidki31 Thanks for the suggestion.

I am aware that #23088 is to follow the behavior of previous versions of spark. But I wonder if we can simply show the summary metrics for all the tasks instead of only the "SUCCESS" ones, as all the tasks are listed in the task table. By doing that should also make sense to users. The implementation will be simpler and we don't have to worry about the performance of the disk store.
Also cc @vanzin @srowen

I don't know enough to evaluate this one, sorry. The code change itself looks plausible.

@gengliangwang If I understand correctly, ~~#23008~~ #23088 is actually fixing this issue. right ?(At least history server case).
Because, count is always filtering out the running tasks, as ExecutorRunTime will be defined for only finished tasks. But scan tasks contains all the tasks, including running tasks.

spark/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

Line 169 in a59fdc4

.index(TaskIndexNames.EXEC_RUN_TIME)

spark/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

Lines 263 to 265 in a59fdc4

.parent(stageKey)

.index(index)

.first(0L)

So, reverting ~~23008~~ 23088 and hence this PR, would not fix the issue? Please correct me if I am wrong.

So sounds like the current behavior is consistent with how Spark has behaved in the past, and this change is proposing a different approach.

A quick look at the history shows the current behavior has been that in forever...

Also, that pointed me at another thing that can be seen in the screenshots. The table header says "Summary Metrics for X Completed Tasks". So this change would be making that wrong...

Ok, I need to backtrack here a little. My bad, I think I know why I'm confused.

I've been starting from the PR title, and it confused me a little bit. Could you change it so it describes what the change is doing? e.g. "Ignore non-succeeded tasks when calculating metric summaries."

Let me re-review trying to ignore that and focusing just on the code.

@vanzin Got it. I have updated the title.
The PR itself is about mismatching the kvstore, and leading to the wrong cache for task summary.

May be we need to add a test in the AppStatusStoreSuite? There all the stores are tested against InMemory store only, I think.

@shahidki31 sure, I have added test cases.

SparkQA · 2019-08-07T11:18:31Z

Test build #108755 has finished for PR 25369 at commit d19c232.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-07T19:52:16Z

Test build #108777 has finished for PR 25369 at commit a23f1f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-08-08T21:17:49Z

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

@@ -136,6 +136,8 @@ private[spark] class AppStatusStore(
    store.read(classOf[StageDataWrapper], Array(stageId, stageAttemptId)).locality
  }

+  private def isInMemoryStore: Boolean = store.isInstanceOf[InMemoryStore] || listener.isDefined


This check isn't whether it's an in-memory store, but whether you're running a live UI (vs. in-memory store in the SHS). Is that your intent? (The method name should match what you intend to check.)

I checked if it is live UI in the first commit.
As per the comment in https://github.com/apache/spark/pull/23088/files#diff-3bd1667878f7bda9a56f95e93a80b475R233 and #25369 (comment), I change it to checking if it is InMemoryStore/Live UI.
But later on, I don't like the idea that live UI is inconsistent with SHS. So I raise a question.

Ok, I looked at this again ignoring the PR title and this makes sense. Sorry for flip-flopping here, but could you put this in a local variable in taskSummary with a comment explaining it? e.g.

// SPARK-26119: we only want to consider successful tasks when calculating the metrics summary, // but currently this is very expensive when using a disk store. So we only trigger the slower code // path when we know we have all data in memory. This check is an approximation of when the know // that the data will be in memory. val isInMemoryStore = store.isInstanceOf[InMemoryStore] || listener.isDefined

Thanks, I have added comments

vanzin · 2019-08-08T21:19:10Z

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

@@ -156,7 +162,8 @@ private[spark] class AppStatusStore(
    // cheaper for disk stores (avoids deserialization).
    val count = {
      Utils.tryWithResource(
-        if (store.isInstanceOf[InMemoryStore]) {
+        if (isInMemoryStore) {


I guess the underlying question you guys are asking is: should the metrics count only successful tasks, or all non-failed tasks?

I don't remember the behavior in 2.2 (which is what #23088 was trying to emulate?), which should have the correct answer.

SparkQA · 2019-08-11T19:31:56Z

Test build #108933 has finished for PR 25369 at commit 08eca6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-08-12T18:46:56Z

Ok; the new comment is not 100% accurate but it's close enough.

Merging to master / 2.4.

…s' metrics ## What changes were proposed in this pull request? Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data. However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks. Thus, even when the running tasks are finished, the out-of-update cached data is returned. This PR is to fix the code in getting the number of "SUCCESS" tasks. ## How was this patch tested? Test manually, run ``` sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect() ``` and keep refreshing the stage page , we can see the task summary metrics is wrong. ### Before fix: ![image](https://user-images.githubusercontent.com/1097932/62560343-6a141780-b8af-11e9-8942-d88540659a93.png) ### After fix: ![image](https://user-images.githubusercontent.com/1097932/62560355-7009f880-b8af-11e9-8ba8-10c083a48d7b.png) Closes #25369 from gengliangwang/fixStagePage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 48d04f7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…s' metrics ## What changes were proposed in this pull request? Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data. However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks. Thus, even when the running tasks are finished, the out-of-update cached data is returned. This PR is to fix the code in getting the number of "SUCCESS" tasks. ## How was this patch tested? Test manually, run ``` sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect() ``` and keep refreshing the stage page , we can see the task summary metrics is wrong. ### Before fix: ![image](https://user-images.githubusercontent.com/1097932/62560343-6a141780-b8af-11e9-8942-d88540659a93.png) ### After fix: ![image](https://user-images.githubusercontent.com/1097932/62560355-7009f880-b8af-11e9-8ba8-10c083a48d7b.png) Closes apache#25369 from gengliangwang/fixStagePage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 48d04f7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

fix

d0dbded

dongjoon-hyun added the WEB UI label Aug 6, 2019

vanzin reviewed Aug 6, 2019

View reviewed changes

gengliangwang added 2 commits August 7, 2019 15:03

address comment

18bde1c

fix test failure

d19c232

gengliangwang commented Aug 7, 2019

View reviewed changes

address comment

a23f1f5

shahidki31 approved these changes Aug 7, 2019

View reviewed changes

vanzin reviewed Aug 8, 2019

View reviewed changes

gengliangwang changed the title ~~[SPARK-28638][WebUI] Task summary metrics are wrong when there are running tasks~~ [SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics Aug 9, 2019

address comments

08eca6e

vanzin closed this in 48d04f7 Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369

[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369

gengliangwang commented Aug 6, 2019 •

edited

Loading

gengliangwang commented Aug 6, 2019

SparkQA commented Aug 6, 2019

vanzin left a comment

vanzin Aug 6, 2019

gengliangwang Aug 7, 2019

SparkQA commented Aug 7, 2019

gengliangwang Aug 7, 2019

shahidki31 Aug 7, 2019

gengliangwang Aug 7, 2019

srowen Aug 7, 2019

shahidki31 Aug 7, 2019 •

edited

Loading

vanzin Aug 8, 2019

vanzin Aug 9, 2019

gengliangwang Aug 9, 2019

shahidki31 Aug 9, 2019

gengliangwang Aug 11, 2019

SparkQA commented Aug 7, 2019

SparkQA commented Aug 7, 2019

vanzin Aug 8, 2019

gengliangwang Aug 9, 2019 •

edited

Loading

vanzin Aug 9, 2019

gengliangwang Aug 11, 2019

vanzin Aug 8, 2019

SparkQA commented Aug 11, 2019

vanzin commented Aug 12, 2019

[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369

[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369

Conversation

gengliangwang commented Aug 6, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Before fix:

After fix:

gengliangwang commented Aug 6, 2019

SparkQA commented Aug 6, 2019

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shahidki31 Aug 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 7, 2019

SparkQA commented Aug 7, 2019

Choose a reason for hiding this comment

gengliangwang Aug 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 11, 2019

vanzin commented Aug 12, 2019

gengliangwang commented Aug 6, 2019 •

edited

Loading

shahidki31 Aug 7, 2019 •

edited

Loading

gengliangwang Aug 9, 2019 •

edited

Loading