SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed #4168

sryza · 2015-01-23T02:55:35Z

This takes advantage of the changes made in SPARK-4337 to cancel pending requests to YARN when they are no longer needed.

Each time the timer in ExecutorAllocationManager strikes, we compute maxNumNeededExecutors, the maximum number of executors we could fill with the current load. This is calculated as the total number of running and pending tasks divided by the number of cores per executor. If maxNumNeededExecutors is below the total number of running and pending executors, we call requestTotalExecutors(maxNumNeededExecutors) to let the cluster manager know that it should cancel any pending requests above this amount. If not, maxNumNeededExecutors is just used as a bound in alongside the configured maxExecutors to limit the number of new requests.

The patch modifies the API exposed by ExecutorAllocationClient for requesting additional executors by moving from requestExecutors to requestTotalExecutors. This makes the communication between the ExecutorAllocationManager and the YarnAllocator easier to reason about and removes some state that needed to be kept in the CoarseGrainedSchedulerBackend. I think an argument can be made that this makes for a less attractive user-facing API in SparkContext, but I'm having trouble envisioning situations where a user would want to use either of these APIs.

This will likely break some tests, but I wanted to get feedback on the approach before adding tests and polishing.

SparkQA · 2015-01-23T02:57:44Z

Test build #25993 has started for PR 4168 at commit 79763d9.

This patch merges cleanly.

SparkQA · 2015-01-23T03:47:56Z

Test build #25993 has finished for PR 4168 at commit 79763d9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T03:48:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25993/
Test FAILed.

lianhuiwang · 2015-01-25T05:29:33Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

    val now = clock.getTimeMillis
-    if (addTime != NOT_SET && now >= addTime) {
-      addExecutors()
+    if (maxNumNeededExecutors < numExecutorsPending + executorIds.size) {


Do we need to exclude executorsPendingToRemove.size?because YarnAllocator have killed toRemoveExecutors, but ExecutorAllocationManager maybe donot receive onExecutorRemoved message. so that time executorIds has removed executors.

Good point, I think you are right.

SparkQA · 2015-01-27T17:17:44Z

Test build #26168 has started for PR 4168 at commit 9ba0e01.

This patch merges cleanly.

SparkQA · 2015-01-27T18:09:38Z

Test build #26168 has finished for PR 4168 at commit 9ba0e01.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-27T18:09:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26168/
Test FAILed.

vanzin · 2015-01-28T00:19:25Z

@sryza looks like the test failures are legit?

lianhuiwang · 2015-01-28T12:19:36Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

@@ -438,6 +444,7 @@ private[spark] class ExecutorAllocationManager(
    }

    override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = {
+      numRunningTasks += 1


i think we should let numRunningTasks to synchronized,because schedule thread and listener thread are two different thread.

Good point. It doesn't need synchronization because only one thread is writing to it, but we should make it volatile.

I take that back on further inspection. It should be synchronized.

andrewor14 · 2015-02-05T23:46:19Z

@sryza is this still WIP? Are we aiming for this to go into 1.3?

sryza · 2015-02-05T23:53:02Z

@andrewor14 it's no longer a WIP, and I am aiming for it for 1.3. I just updated the title - sorry for the confusion.

andrewor14 · 2015-02-09T03:03:17Z

retest this please

SparkQA · 2015-02-09T03:08:08Z

Test build #27073 has started for PR 4168 at commit 9ba0e01.

This patch does not merge cleanly.

SparkQA · 2015-02-09T04:08:04Z

Test build #27077 has started for PR 4168 at commit 16db9f4.

This patch merges cleanly.

andrewor14 · 2015-02-09T04:11:50Z

core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala

  /**
-   * If the add time has expired, request new executors and refresh the add time.
-   * If the remove time for an existing executor has expired, kill the executor.
+   * This is called at a fixed interval to relegate the number of pending executor requests


I don't think relegate here is the right word. Did you mean regulate?

Oops definitely

SparkQA · 2015-02-09T04:27:33Z

Test build #27073 has finished for PR 4168 at commit 9ba0e01.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2015-02-10T04:12:36Z

Test build #27170 has started for PR 4168 at commit 37ce77d.

This patch merges cleanly.

SparkQA · 2015-02-10T04:14:42Z

Test build #27170 has finished for PR 4168 at commit 37ce77d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-10T04:14:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27170/
Test FAILed.

andrewor14 · 2015-02-10T04:30:15Z

core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala

   * Request an additional number of executors from the cluster manager.
-   * Return whether the request is acknowledged by the cluster manager.
+   * This is currently only supported in YARN mode. Return whether the request is received.


If we want to add in the javadocs that it's only supported in YARN mode, we should do it for all the methods here or just move this to the class javadocs. I prefer the latter.

andrewor14 · 2015-02-10T05:07:05Z

Hey @sryza thanks for reiterating quickly on the reviews. I left 1 question but other than that this looks pretty close.

SparkQA · 2015-02-10T05:12:03Z

Test build #27168 has finished for PR 4168 at commit f80b7ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class GetField(child: Expression, field: StructField, ordinal: Int) extends UnaryExpression

AmplabJenkins · 2015-02-10T05:12:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27168/
Test PASSed.

pwendell · 2015-02-10T07:38:13Z

(I only looked at the public API's, but those look fine to me now - there are none!)

andrewor14 · 2015-02-10T19:06:30Z

Hey @sryza thanks a lot for fixing this. I will merge this into master and 1.3 after fixing the last batch of comments that I pointed out when I merge this.

andrewor14 · 2015-02-10T19:10:09Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   * result in canceling pending requests or filing additional requests.
+   * This is currently only supported in Yarn mode. Return whether the request is received.
+   */
+  @DeveloperApi


not developer api if it's private[spark]

…uests when no longer needed This takes advantage of the changes made in SPARK-4337 to cancel pending requests to YARN when they are no longer needed. Each time the timer in `ExecutorAllocationManager` strikes, we compute `maxNumNeededExecutors`, the maximum number of executors we could fill with the current load. This is calculated as the total number of running and pending tasks divided by the number of cores per executor. If `maxNumNeededExecutors` is below the total number of running and pending executors, we call `requestTotalExecutors(maxNumNeededExecutors)` to let the cluster manager know that it should cancel any pending requests above this amount. If not, `maxNumNeededExecutors` is just used as a bound in alongside the configured `maxExecutors` to limit the number of new requests. The patch modifies the API exposed by `ExecutorAllocationClient` for requesting additional executors by moving from `requestExecutors` to `requestTotalExecutors`. This makes the communication between the `ExecutorAllocationManager` and the `YarnAllocator` easier to reason about and removes some state that needed to be kept in the `CoarseGrainedSchedulerBackend`. I think an argument can be made that this makes for a less attractive user-facing API in `SparkContext`, but I'm having trouble envisioning situations where a user would want to use either of these APIs. This will likely break some tests, but I wanted to get feedback on the approach before adding tests and polishing. Author: Sandy Ryza <sandy@cloudera.com> Closes #4168 from sryza/sandy-spark-4136 and squashes the following commits: 37ce77d [Sandy Ryza] Warn on negative number cd3b2ff [Sandy Ryza] SPARK-4136 (cherry picked from commit 69bc3bb) Signed-off-by: Andrew Or <andrew@databricks.com>

SparkQA · 2015-02-10T19:13:00Z

Test build #27220 has started for PR 4168 at commit 3cca880.

This patch merges cleanly.

SparkQA · 2015-02-10T19:15:33Z

Test build #27220 has finished for PR 4168 at commit 3cca880.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-10T19:15:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27220/
Test FAILed.

andrewor14 · 2015-02-10T19:19:31Z

@sryza looks like I narrowly missed your last commit when I merged this. I have copied the changes there to this HOTFIX commit: b640c84

lianhuiwang reviewed Jan 25, 2015
View reviewed changes

sryza force-pushed the sandy-spark-4136 branch from 79763d9 to 9ba0e01 Compare January 27, 2015 17:14

lianhuiwang reviewed Jan 28, 2015
View reviewed changes

andrewor14 mentioned this pull request Feb 5, 2015

SPARK-4337. [YARN] Add ability to cancel pending requests #4141

Closed

sryza changed the title ~~SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed [WIP]~~ SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed Feb 5, 2015

sryza force-pushed the sandy-spark-4136 branch from 9ba0e01 to 16db9f4 Compare February 9, 2015 04:02

andrewor14 reviewed Feb 9, 2015
View reviewed changes

sryza added 2 commits February 9, 2015 20:08

SPARK-4136

cd3b2ff

Warn on negative number

37ce77d

sryza force-pushed the sandy-spark-4136 branch from f80b7ec to 37ce77d Compare February 10, 2015 04:09

andrewor14 reviewed Feb 10, 2015
View reviewed changes

more review comments

3cca880

andrewor14 reviewed Feb 10, 2015
View reviewed changes

asfgit closed this in 69bc3bb Feb 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed #4168

SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed #4168

sryza commented Jan 23, 2015

SparkQA commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

lianhuiwang Jan 25, 2015

sryza Jan 26, 2015

SparkQA commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

vanzin commented Jan 28, 2015

lianhuiwang Jan 28, 2015

sryza Feb 9, 2015

sryza Feb 9, 2015

andrewor14 commented Feb 5, 2015

sryza commented Feb 5, 2015

andrewor14 commented Feb 9, 2015

SparkQA commented Feb 9, 2015

SparkQA commented Feb 9, 2015

andrewor14 Feb 9, 2015

sryza Feb 9, 2015

SparkQA commented Feb 9, 2015

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

andrewor14 Feb 10, 2015

andrewor14 commented Feb 10, 2015

SparkQA commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

pwendell commented Feb 10, 2015

andrewor14 commented Feb 10, 2015

andrewor14 Feb 10, 2015

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

andrewor14 commented Feb 10, 2015

SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed #4168

SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed #4168

Conversation

sryza commented Jan 23, 2015

SparkQA commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

vanzin commented Jan 28, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewor14 commented Feb 5, 2015

sryza commented Feb 5, 2015

andrewor14 commented Feb 9, 2015

SparkQA commented Feb 9, 2015

SparkQA commented Feb 9, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 9, 2015

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

Choose a reason for hiding this comment

andrewor14 commented Feb 10, 2015

SparkQA commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

pwendell commented Feb 10, 2015

andrewor14 commented Feb 10, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

AmplabJenkins commented Feb 10, 2015

andrewor14 commented Feb 10, 2015