[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11272

bbossy · 2016-02-19T14:20:23Z

Problem description:

Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.

Context and analysis:

spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159

This is a follow up on #11207 .

What changes were proposed in this pull request?

This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.

How was the this patch tested?

This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:

16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort}
16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs
16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs

Note: there are 2 executors running on this slave.

bbossy · 2016-02-19T14:23:10Z

@dragos @tnachen : Let me know if you had something else in mind (or if I should wire it up differently)

dragos · 2016-02-19T16:17:38Z

I'll have to come back to this on Monday. Judging by the description, it looks good.

andrewor14 · 2016-02-19T17:49:52Z

ok to test

SparkQA · 2016-02-19T18:06:28Z

Test build #51562 has finished for PR 11272 at commit bd30655.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class ShuffleServiceHeartbeat extends BlockTransferMessage

bbossy · 2016-02-19T18:18:09Z

Oops. Sorry about that..

SparkQA · 2016-02-19T20:32:31Z

Test build #51567 has finished for PR 11272 at commit 9a3d625.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dragos · 2016-02-22T14:12:20Z

core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala

+    def unapply(h: ShuffleServiceHeartbeat): Option[String] = Some(h.getAppId)
+  }
+
+  private case class AppState(heartbeatTimeout: Long, var lastHeartbeat: Long)


lastHeartbeat is used from different threads, so it should be @volatile

good catch!

dragos · 2016-02-22T14:56:01Z

I'm done reviewing, I only have a couple of small observations.

bbossy · 2016-02-22T18:27:49Z

@dragos Thanks for the review! Let me know if you find anything else.

andrewor14 · 2016-02-22T18:57:45Z

I would like to backport a variant of this patch into 1.6.1. This was clearly working in 1.5 but then became unusable in 1.6. This means people who upgrade to 1.6 will no longer be able to use dynamic allocation in Spark, a huge regression.

As with all backports there is some risk that we might introduce a regression between 1.6.0 and 1.6.1. It's especially bad if we introduce a regression between maintenance releases, however, because then users will lose confidence in upgrading.

Therefore, it would be best to keep the changed surface area minimal, i.e. to ensure that this patch only affects users of Spark on Mesos with external shuffle service. I haven't looked at this patch in great detail but @bbossy @dragos it would be great if you could keep that in mind when reviewing / addressing review comments.

Thanks for fixing this critical issue!

andrewor14 · 2016-02-22T18:59:25Z

core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala


 /**
 * An RPC endpoint that receives registration requests from Spark drivers running on Mesos.
 * It detects driver termination and calls the cleanup callback to [[ExternalShuffleService]].
 */
-private[mesos] class MesosExternalShuffleBlockHandler(transportConf: TransportConf)
+private[mesos] class MesosExternalShuffleBlockHandler(
+  transportConf: TransportConf, cleanerIntervalS: Long)


style:

private[mesos] class MesosExternalShuffleBlockHandler( transportConf: TransportConf, cleanerIntervalSeconds: Long) extends ...

andrewor14 · 2016-02-22T19:21:55Z

(Update: this will have to go into 1.6.2 since the 1.6.1 RC is being cut today already)

SparkQA · 2016-02-23T00:16:50Z

Test build #51670 has finished for PR 11272 at commit d4d1ad7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-23T09:10:45Z

Test build #51739 has finished for PR 11272 at commit 0a2d4cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2016-02-29T19:50:19Z

@dragos any other comments?

andrewor14 · 2016-02-29T19:50:36Z

Also cc @tnachen and @mgummelt for another pair of eyes

tnachen · 2016-03-01T00:48:40Z

core/src/main/scala/org/apache/spark/deploy/mesos/MesosExternalShuffleService.scala

@@ -93,7 +113,8 @@ private[mesos] class MesosExternalShuffleService(conf: SparkConf, securityManage

  protected override def newShuffleBlockHandler(
      conf: TransportConf): ExternalShuffleBlockHandler = {
-    new MesosExternalShuffleBlockHandler(conf)
+    val cleanerIntervalS = this.conf.getTimeAsSeconds("spark.shuffle.cleanerInterval", "30s")


Is it a Spark convention to add a "S" in the end to tag it as Seconds? I'm seeing more example in the code base to have it spelled out. I'd say we should be consistent and use cleanerIntervalSeconds.
Also this is only used in the Mesos shuffle service, I think should name space the configuration and also be consistent with other interval setting names, how about "spark.mesos.shuffle.cleaner.interval"

tnachen · 2016-03-01T01:02:54Z

This looks like a great candidate to add integration tests for in the mesos-spark-integration-tests suite. Ideally we have something long running that we can run. @dragos perhaps either you guys or we can add it. Ideally longer term we can let community contributors like @bbossy to also add tests there and verify everything runs e2e, but I think currently we don't really have docs to show how to add a test :(

dragos · 2016-03-01T12:45:56Z

Yes, we could have an integration test for this, shouldn't be too hard to add. The basic idea is to decrease the network timeout and have a job that exceeds that but still succeeds. But I don't think it should block this PR

tnachen · 2016-03-01T17:59:41Z

@dragos Yes it shouldn't block the PR, just mentioning it as I see the need for it. I have two comments on this, otherwise it LGTM too.

andrewor14 · 2016-03-11T19:20:39Z

What's the status on this patch? Have we at least tested it manually? At the very least we should merge it into 2.0.

@bbossy would you mind rebasing this?

tnachen · 2016-03-11T22:22:35Z

@andrewor14 I'll try to create a test to verify this, when is the 2.0 closing date?

andrewor14 · 2016-03-11T23:36:41Z

end of month I think?

dragos · 2016-03-12T12:14:06Z

AFAIK this could go in. I did test it manually and things worked well.

…vice

bbossy · 2016-03-12T13:14:00Z

Sorry about the long silence. Rebased the PR. I haven't had the time yet to test it after the rebase. I should be able to do this until the beginning of next week.

SparkQA · 2016-03-12T15:22:33Z

Test build #53004 has finished for PR 11272 at commit 3a9914b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bbossy · 2016-03-13T12:57:03Z

I tested the rebased version manually. It is still working after the rebase.

andrewor14 · 2016-03-14T19:22:07Z

OK, I'm going to merge this into master. This could potentially go into 1.6 but it's a little scary to merge it there. I'm inclined to leave it out of 1.6 to be conservative. If we can work out a less invasive change later for 1.6 then we could consider that.

…s before application has stopped ## Problem description: Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. ### Context and analysis: spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . ## What changes were proposed in this pull request? This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. ## How was the this patch tested? This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

…s before application has stopped Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.

* No new test failures introduced. * Provisional backport complete

…s before application has stopped Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped. spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583 External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159 This is a follow up on apache#11207 . This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case. This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service: ``` 16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms). 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort} 16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files. 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs 16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs ``` Note: there are 2 executors running on this slave. Author: Bertrand Bossy <bertrand.bossy@teralytics.net> Closes apache#11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat. Initial backport of apache#11272 * No new test failures introduced. * Provisional backport complete

dragos reviewed Feb 22, 2016
View reviewed changes

andrewor14 reviewed Feb 22, 2016
View reviewed changes

tnachen reviewed Mar 1, 2016
View reviewed changes

SPARK-12583: Heartbeat from MesosExternalShuffleClient to shuffle ser…

1118c8f

…vice

Bertrand Bossy added 4 commits March 12, 2016 13:56

SPARK-12583: Add missing license header

b33c227

SPARK-12583: Addressing comments

c632c6c

SPARK-12583: Address comment: style

2f6c0e6

SPARK-12583: Address comments & rebase

3a9914b

asfgit closed this in 310981d Mar 14, 2016

mgummelt mentioned this pull request Mar 31, 2016

[SPARK-11327] [MESOS] Dispatcher does not respect all args from the Submit request #10370

Closed

bbossy deleted the SPARK-12583-mesos-shuffle-service-heartbeat branch April 4, 2016 09:09

corruptmemory mentioned this pull request Apr 26, 2016

Backport: Spark 12583 lightbend/spark#30

Closed

corruptmemory added a commit to lightbend/spark that referenced this pull request May 2, 2016

Initial backport of apache#11272

ea7419a

* No new test failures introduced. * Provisional backport complete

corruptmemory mentioned this pull request May 24, 2016

[SPARK-12583][MESOS] BACKPORT to 1.6.x - Mesos shuffle service: Don't delete shuffle file… #13279

Closed

IgorBerman mentioned this pull request Jan 31, 2018

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11272

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11272

bbossy commented Feb 19, 2016

bbossy commented Feb 19, 2016

dragos commented Feb 19, 2016

andrewor14 commented Feb 19, 2016

SparkQA commented Feb 19, 2016

bbossy commented Feb 19, 2016

SparkQA commented Feb 19, 2016

dragos Feb 22, 2016

bbossy Feb 22, 2016

dragos commented Feb 22, 2016

bbossy commented Feb 22, 2016

andrewor14 commented Feb 22, 2016

andrewor14 Feb 22, 2016

andrewor14 commented Feb 22, 2016

SparkQA commented Feb 23, 2016

SparkQA commented Feb 23, 2016

andrewor14 commented Feb 29, 2016

andrewor14 commented Feb 29, 2016

tnachen Mar 1, 2016

tnachen commented Mar 1, 2016

dragos commented Mar 1, 2016

tnachen commented Mar 1, 2016

andrewor14 commented Mar 11, 2016

tnachen commented Mar 11, 2016

andrewor14 commented Mar 11, 2016

dragos commented Mar 12, 2016

bbossy commented Mar 12, 2016

SparkQA commented Mar 12, 2016

bbossy commented Mar 13, 2016

andrewor14 commented Mar 14, 2016

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11272

[SPARK-12583][Mesos] Mesos shuffle service: Don't delete shuffle files before application has stopped #11272

Conversation

bbossy commented Feb 19, 2016

Problem description:

Context and analysis:

What changes were proposed in this pull request?

How was the this patch tested?

bbossy commented Feb 19, 2016

dragos commented Feb 19, 2016

andrewor14 commented Feb 19, 2016

SparkQA commented Feb 19, 2016

bbossy commented Feb 19, 2016

SparkQA commented Feb 19, 2016

dragos Feb 22, 2016

Choose a reason for hiding this comment

bbossy Feb 22, 2016

Choose a reason for hiding this comment

dragos commented Feb 22, 2016

bbossy commented Feb 22, 2016

andrewor14 commented Feb 22, 2016

andrewor14 Feb 22, 2016

Choose a reason for hiding this comment

andrewor14 commented Feb 22, 2016

SparkQA commented Feb 23, 2016

SparkQA commented Feb 23, 2016

andrewor14 commented Feb 29, 2016

andrewor14 commented Feb 29, 2016

tnachen Mar 1, 2016

Choose a reason for hiding this comment

tnachen commented Mar 1, 2016

dragos commented Mar 1, 2016

tnachen commented Mar 1, 2016

andrewor14 commented Mar 11, 2016

tnachen commented Mar 11, 2016

andrewor14 commented Mar 11, 2016

dragos commented Mar 12, 2016

bbossy commented Mar 12, 2016

SparkQA commented Mar 12, 2016

bbossy commented Mar 13, 2016

andrewor14 commented Mar 14, 2016