-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28638][WebUI] Task summary should only contain successful tasks' metrics #25369
Conversation
cc @vanzin |
Test build #108727 has finished for PR 25369 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failures look legitimate.
@@ -156,7 +156,8 @@ private[spark] class AppStatusStore( | |||
// cheaper for disk stores (avoids deserialization). | |||
val count = { | |||
Utils.tryWithResource( | |||
if (store.isInstanceOf[InMemoryStore]) { | |||
if (store.isInstanceOf[ElementTrackingStore]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a new method private def isLiveUI: Boolean
would make the intent here clearer. It can be this instance check or just checking whether listener
is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin Thanks for the suggestion. I have updated the code.
Test build #108751 has finished for PR 25369 at commit
|
@@ -156,7 +162,8 @@ private[spark] class AppStatusStore( | |||
// cheaper for disk stores (avoids deserialization). | |||
val count = { | |||
Utils.tryWithResource( | |||
if (store.isInstanceOf[InMemoryStore]) { | |||
if (isInMemoryStore) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the comment in https://github.com/apache/spark/pull/23088/files#diff-3bd1667878f7bda9a56f95e93a80b475R233, I think it is on purpose that only count the "SUCCESS" task when with InMemoryStore
.
cc @shahidki31
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gengliangwang . History server uses InMemory
store by default. For Disk store case, this isn't an optimal way for finding success task. I am yet to raise a PR for supporting for Disk store case.
I think you just need to add,
isInMemoryStore: Boolean = listener.isDefined || store.isInstanceOf[InMemoryStore]
Will test your code with all the scenarios.
Any case, this would be temporary as we need to support for Diskstore also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shahidki31 Thanks for the suggestion.
I am aware that #23088 is to follow the behavior of previous versions of spark. But I wonder if we can simply show the summary metrics for all the tasks instead of only the "SUCCESS" ones, as all the tasks are listed in the task table. By doing that should also make sense to users. The implementation will be simpler and we don't have to worry about the performance of the disk store.
Also cc @vanzin @srowen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough to evaluate this one, sorry. The code change itself looks plausible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang If I understand correctly, #23008 #23088 is actually fixing this issue. right ?(At least history server case).
Because, count
is always filtering out the running tasks, as ExecutorRunTime
will be defined for only finished tasks. But scan tasks
contains all the tasks, including running tasks.
.index(TaskIndexNames.EXEC_RUN_TIME) |
spark/core/src/main/scala/org/apache/spark/status/AppStatusStore.scala
Lines 263 to 265 in a59fdc4
.parent(stageKey) | |
.index(index) | |
.first(0L) |
So, reverting 23008 23088 and hence this PR, would not fix the issue? Please correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sounds like the current behavior is consistent with how Spark has behaved in the past, and this change is proposing a different approach.
A quick look at the history shows the current behavior has been that in forever...
Also, that pointed me at another thing that can be seen in the screenshots. The table header says "Summary Metrics for X Completed Tasks". So this change would be making that wrong...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I need to backtrack here a little. My bad, I think I know why I'm confused.
I've been starting from the PR title, and it confused me a little bit. Could you change it so it describes what the change is doing? e.g. "Ignore non-succeeded tasks when calculating metric summaries."
Let me re-review trying to ignore that and focusing just on the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin Got it. I have updated the title.
The PR itself is about mismatching the kvstore, and leading to the wrong cache for task summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we need to add a test in the AppStatusStoreSuite? There all the stores are tested against InMemory store only, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shahidki31 sure, I have added test cases.
Test build #108755 has finished for PR 25369 at commit
|
Test build #108777 has finished for PR 25369 at commit
|
@@ -136,6 +136,8 @@ private[spark] class AppStatusStore( | |||
store.read(classOf[StageDataWrapper], Array(stageId, stageAttemptId)).locality | |||
} | |||
|
|||
private def isInMemoryStore: Boolean = store.isInstanceOf[InMemoryStore] || listener.isDefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check isn't whether it's an in-memory store, but whether you're running a live UI (vs. in-memory store in the SHS). Is that your intent? (The method name should match what you intend to check.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked if it is live UI in the first commit.
As per the comment in https://github.com/apache/spark/pull/23088/files#diff-3bd1667878f7bda9a56f95e93a80b475R233 and #25369 (comment), I change it to checking if it is InMemoryStore/Live UI.
But later on, I don't like the idea that live UI is inconsistent with SHS. So I raise a question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I looked at this again ignoring the PR title and this makes sense. Sorry for flip-flopping here, but could you put this in a local variable in taskSummary
with a comment explaining it? e.g.
// SPARK-26119: we only want to consider successful tasks when calculating the metrics summary,
// but currently this is very expensive when using a disk store. So we only trigger the slower code
// path when we know we have all data in memory. This check is an approximation of when the know
// that the data will be in memory.
val isInMemoryStore = store.isInstanceOf[InMemoryStore] || listener.isDefined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have added comments
@@ -156,7 +162,8 @@ private[spark] class AppStatusStore( | |||
// cheaper for disk stores (avoids deserialization). | |||
val count = { | |||
Utils.tryWithResource( | |||
if (store.isInstanceOf[InMemoryStore]) { | |||
if (isInMemoryStore) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the underlying question you guys are asking is: should the metrics count only successful tasks, or all non-failed tasks?
I don't remember the behavior in 2.2 (which is what #23088 was trying to emulate?), which should have the correct answer.
Test build #108933 has finished for PR 25369 at commit
|
Ok; the new comment is not 100% accurate but it's close enough. Merging to master / 2.4. |
…s' metrics ## What changes were proposed in this pull request? Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data. However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks. Thus, even when the running tasks are finished, the out-of-update cached data is returned. This PR is to fix the code in getting the number of "SUCCESS" tasks. ## How was this patch tested? Test manually, run ``` sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect() ``` and keep refreshing the stage page , we can see the task summary metrics is wrong. ### Before fix:  ### After fix:  Closes #25369 from gengliangwang/fixStagePage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 48d04f7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…s' metrics ## What changes were proposed in this pull request? Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data. However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks. Thus, even when the running tasks are finished, the out-of-update cached data is returned. This PR is to fix the code in getting the number of "SUCCESS" tasks. ## How was this patch tested? Test manually, run ``` sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect() ``` and keep refreshing the stage page , we can see the task summary metrics is wrong. ### Before fix:  ### After fix:  Closes apache#25369 from gengliangwang/fixStagePage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 48d04f7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
…s' metrics ## What changes were proposed in this pull request? Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data. However, the number of "SUCCESS" tasks is wrong when there are running tasks. In `AppStatusStore`, the KVStore is `ElementTrackingStore`, instead of `InMemoryStore`. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks. Thus, even when the running tasks are finished, the out-of-update cached data is returned. This PR is to fix the code in getting the number of "SUCCESS" tasks. ## How was this patch tested? Test manually, run ``` sc.parallelize(1 to 160, 40).map(i => Thread.sleep(i*100)).collect() ``` and keep refreshing the stage page , we can see the task summary metrics is wrong. ### Before fix:  ### After fix:  Closes apache#25369 from gengliangwang/fixStagePage. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> (cherry picked from commit 48d04f7) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
What changes were proposed in this pull request?
Currently, on requesting summary metrics, cached data are returned if the current number of "SUCCESS" tasks is the same as the value in cached data.
However, the number of "SUCCESS" tasks is wrong when there are running tasks. In
AppStatusStore
, the KVStore isElementTrackingStore
, instead ofInMemoryStore
. The value count is always the number of "SUCCESS" tasks + "RUNNING" tasks.Thus, even when the running tasks are finished, the out-of-update cached data is returned.
This PR is to fix the code in getting the number of "SUCCESS" tasks.
How was this patch tested?
Test manually, run
and keep refreshing the stage page , we can see the task summary metrics is wrong.
Before fix:
After fix: