Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369

Closed
wants to merge 5 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Aug 6, 2019

What changes were proposed in this pull request?

Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data.
However, the number of "SUCCESS" tasks is wrong when there are running tasks. In AppStatusStore, the KVStore is ElementTrackingStore, instead of InMemoryStore. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks.
Thus, even when the running tasks are finished, the out-of-update cached data is returned.

This PR is to fix the code in getting the number of "SUCCESS" tasks.

How was this patch tested?

Test manually, run

sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect()

and keep refreshing the stage page , we can see the task summary metrics is wrong.

Before fix:

image

After fix:

image

@gengliangwang
Copy link
Member Author

cc @vanzin

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108727 has finished for PR 25369 at commit d0dbded.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failures look legitimate.

@@ -156,7 +156,8 @@ private[spark] class AppStatusStore(
// cheaper for disk stores (avoids deserialization).
val count = {
Utils.tryWithResource(
if (store.isInstanceOf[InMemoryStore]) {
if (store.isInstanceOf[ElementTrackingStore]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a new method private def isLiveUI: Boolean would make the intent here clearer. It can be this instance check or just checking whether listener is defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin Thanks for the suggestion. I have updated the code.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108751 has finished for PR 25369 at commit 18bde1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -156,7 +162,8 @@ private[spark] class AppStatusStore(
// cheaper for disk stores (avoids deserialization).
val count = {
Utils.tryWithResource(
if (store.isInstanceOf[InMemoryStore]) {
if (isInMemoryStore) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the comment in https://github.com/apache/spark/pull/23088/files#diff-3bd1667878f7bda9a56f95e93a80b475R233, I think it is on purpose that only count the "SUCCESS" task when with InMemoryStore.
cc @shahidki31

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gengliangwang . History server uses InMemory store by default. For Disk store case, this isn't an optimal way for finding success task. I am yet to raise a PR for supporting for Disk store case.

I think you just need to add,
isInMemoryStore: Boolean = listener.isDefined || store.isInstanceOf[InMemoryStore]

Will test your code with all the scenarios.
Any case, this would be temporary as we need to support for Diskstore also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahidki31 Thanks for the suggestion.

I am aware that #23088 is to follow the behavior of previous versions of spark. But I wonder if we can simply show the summary metrics for all the tasks instead of only the "SUCCESS" ones, as all the tasks are listed in the task table. By doing that should also make sense to users. The implementation will be simpler and we don't have to worry about the performance of the disk store.
Also cc @vanzin @srowen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough to evaluate this one, sorry. The code change itself looks plausible.

Copy link
Contributor

@shahidki31 shahidki31 Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang If I understand correctly, #23008 #23088 is actually fixing this issue. right ?(At least history server case).
Because, count is always filtering out the running tasks, as ExecutorRunTime will be defined for only finished tasks. But scan tasks contains all the tasks, including running tasks.

.index(TaskIndexNames.EXEC_RUN_TIME)

.parent(stageKey)
.index(index)
.first(0L)

So, reverting 23008 23088 and hence this PR, would not fix the issue? Please correct me if I am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sounds like the current behavior is consistent with how Spark has behaved in the past, and this change is proposing a different approach.

A quick look at the history shows the current behavior has been that in forever...

Also, that pointed me at another thing that can be seen in the screenshots. The table header says "Summary Metrics for X Completed Tasks". So this change would be making that wrong...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I need to backtrack here a little. My bad, I think I know why I'm confused.

I've been starting from the PR title, and it confused me a little bit. Could you change it so it describes what the change is doing? e.g. "Ignore non-succeeded tasks when calculating metric summaries."

Let me re-review trying to ignore that and focusing just on the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin Got it. I have updated the title.
The PR itself is about mismatching the kvstore, and leading to the wrong cache for task summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we need to add a test in the AppStatusStoreSuite? There all the stores are tested against InMemory store only, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahidki31 sure, I have added test cases.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108755 has finished for PR 25369 at commit d19c232.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108777 has finished for PR 25369 at commit a23f1f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -136,6 +136,8 @@ private[spark] class AppStatusStore(
store.read(classOf[StageDataWrapper], Array(stageId, stageAttemptId)).locality
}

private def isInMemoryStore: Boolean = store.isInstanceOf[InMemoryStore] || listener.isDefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check isn't whether it's an in-memory store, but whether you're running a live UI (vs. in-memory store in the SHS). Is that your intent? (The method name should match what you intend to check.)

Copy link
Member Author

@gengliangwang gengliangwang Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked if it is live UI in the first commit.
As per the comment in https://github.com/apache/spark/pull/23088/files#diff-3bd1667878f7bda9a56f95e93a80b475R233 and #25369 (comment), I change it to checking if it is InMemoryStore/Live UI.
But later on, I don't like the idea that live UI is inconsistent with SHS. So I raise a question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I looked at this again ignoring the PR title and this makes sense. Sorry for flip-flopping here, but could you put this in a local variable in taskSummary with a comment explaining it? e.g.

// SPARK-26119: we only want to consider successful tasks when calculating the metrics summary,
// but currently this is very expensive when using a disk store. So we only trigger the slower code
// path when we know we have all data in memory. This check is an approximation of when the know
// that the data will be in memory.
val isInMemoryStore = store.isInstanceOf[InMemoryStore] || listener.isDefined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have added comments

@@ -156,7 +162,8 @@ private[spark] class AppStatusStore(
// cheaper for disk stores (avoids deserialization).
val count = {
Utils.tryWithResource(
if (store.isInstanceOf[InMemoryStore]) {
if (isInMemoryStore) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the underlying question you guys are asking is: should the metrics count only successful tasks, or all non-failed tasks?

I don't remember the behavior in 2.2 (which is what #23088 was trying to emulate?), which should have the correct answer.

@gengliangwang gengliangwang changed the title [SPARK-28638][WebUI] Task summary metrics are wrong when there are running tasks [SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics Aug 9, 2019
@SparkQA
Copy link

SparkQA commented Aug 11, 2019

Test build #108933 has finished for PR 25369 at commit 08eca6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Aug 12, 2019

Ok; the new comment is not 100% accurate but it's close enough.

Merging to master / 2.4.

@vanzin vanzin closed this in 48d04f7 Aug 12, 2019
vanzin pushed a commit that referenced this pull request Aug 12, 2019
…s' metrics

## What changes were proposed in this pull request?

Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data.
However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks.
Thus, even when the running tasks are finished, the out-of-update cached data is returned.

This PR is to fix the code in getting the number of "SUCCESS" tasks.

## How was this patch tested?

Test manually, run
```
sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect()
```
and keep refreshing the stage page , we can see the task summary metrics is wrong.

### Before fix:
![image](https://user-images.githubusercontent.com/1097932/62560343-6a141780-b8af-11e9-8942-d88540659a93.png)

### After fix:
![image](https://user-images.githubusercontent.com/1097932/62560355-7009f880-b8af-11e9-8ba8-10c083a48d7b.png)

Closes #25369 from gengliangwang/fixStagePage.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(cherry picked from commit 48d04f7)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…s' metrics

## What changes were proposed in this pull request?

Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data.
However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks.
Thus, even when the running tasks are finished, the out-of-update cached data is returned.

This PR is to fix the code in getting the number of "SUCCESS" tasks.

## How was this patch tested?

Test manually, run
```
sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect()
```
and keep refreshing the stage page , we can see the task summary metrics is wrong.

### Before fix:
![image](https://user-images.githubusercontent.com/1097932/62560343-6a141780-b8af-11e9-8942-d88540659a93.png)

### After fix:
![image](https://user-images.githubusercontent.com/1097932/62560355-7009f880-b8af-11e9-8ba8-10c083a48d7b.png)

Closes apache#25369 from gengliangwang/fixStagePage.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(cherry picked from commit 48d04f7)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…s' metrics

## What changes were proposed in this pull request?

Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data.
However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks.
Thus, even when the running tasks are finished, the out-of-update cached data is returned.

This PR is to fix the code in getting the number of "SUCCESS" tasks.

## How was this patch tested?

Test manually, run
```
sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect()
```
and keep refreshing the stage page , we can see the task summary metrics is wrong.

### Before fix:
![image](https://user-images.githubusercontent.com/1097932/62560343-6a141780-b8af-11e9-8942-d88540659a93.png)

### After fix:
![image](https://user-images.githubusercontent.com/1097932/62560355-7009f880-b8af-11e9-8ba8-10c083a48d7b.png)

Closes apache#25369 from gengliangwang/fixStagePage.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
(cherry picked from commit 48d04f7)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants