[SPARK-25299] Simpler scheduler integration #555

yifeih · 2019-05-22T00:55:35Z

No description provided.

yifeih · 2019-05-22T02:05:29Z

This is a much simpler, less invasive version of of #548, but with limited functionality (and I'm not entirely sure it works... see below for reasoning)

To better support async and individual file server implementations, it would be helpful to be able to retrigger map tasks when a fetch failure happens (if we do not support retriggering map tasks, a fetch failure will always result in the entire job failing). This change allows retriggering on the simplest level: it only invalidates the mapper associated with the FetchFailure, and doesn't attempt to remove other MapStatuses on the same host or execId.

In this scenario, we'd expect other MapStatuses on the same hosts/execIds will be removed by other FetchFailed exceptions from other reducers. However, I'm not entirely sure it currently works like this. I noticed that the FetchFailed exception is ignored if the current stage's attemptId is not the one returned by the FetchFailed exception (https://github.com/palantir/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1519). Based on some of the code here in submitMissingTasks() (https://github.com/palantir/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1110), I think it takes whatever MapStatuses are missing in the MapOutputTracker and only resubmits those. If we're waiting on other executors to report back FetchFailed from the same source, and the FetchFailed errors don't return quickly enough, then a resubmission of missing tasks between every FetchFailure could resulting in maximizing the attemptId before we've marked all the MapStatuses for the same host/execId as needing a retry.

Hopefully that made sense? At least, I think that's how it works, in which case we'd probably need another type of solution...

@squito @mccheah for comments?

squito · 2019-05-24T21:02:08Z

yes, I think you're description of the problem is correct -- but I just commented on your google doc that I think you have the same problem in the use of unregisterOtherMapStatusesOnFetchFailure in the async case.

I'm wondering if really the driver needs to handle a new message UpdatedShuffleBlockLocation which would allow async replicas, rebalancers etc. to tell the driver. (But I'm not sure we need to do that now either.)

I'm still a little stuck, on this and #548, on why we need to allow each shuffle block to go to a different location, instead of sending the entire output of one map task to the same destination. Obviously that would be more flexible, but not doing that still seems to allow a lot of the designs we've been considering, while reducing the complexity a lot, so seems like the right step for incremental improvement.

yifeih added 5 commits May 21, 2019 16:16

only remove host/execId for default impl

549943f

initial changes

72ab9dd

add tests

f69ff59

scalastyle

98fbd21

oops

49cedeb

yifeih closed this Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25299] Simpler scheduler integration #555

[SPARK-25299] Simpler scheduler integration #555

yifeih commented May 22, 2019

yifeih commented May 22, 2019

squito commented May 24, 2019

[SPARK-25299] Simpler scheduler integration #555

[SPARK-25299] Simpler scheduler integration #555

Conversation

yifeih commented May 22, 2019

yifeih commented May 22, 2019

squito commented May 24, 2019