Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver #4363

Closed
wants to merge 20 commits into from
Closed

[SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver #4363

wants to merge 20 commits into from

Conversation

shenh062326
Copy link
Contributor

If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent. We should expireDeadHosts at HeartbeatReceiver.

@shenh062326
Copy link
Contributor Author

add [SPARK-5529]

@shenh062326 shenh062326 changed the title Add expireDeadHosts in HeartbeatReceiver [SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver Feb 4, 2015
@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26744 has started for PR 4363 at commit c922cb0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26744 has finished for PR 4363 at commit c922cb0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26744/
Test FAILed.

@sryza
Copy link
Contributor

sryza commented Feb 4, 2015

Hi @shenh062326 , can you remove the dead host expiry in BlockManagerMasterActor so we don't have it on two places?

val msg = "Removing Executor " + executorId + " with no recent heart beats: "
+(now - lastSeenMs) + "ms exceeds " + slaveTimeout + "ms"
logWarning(msg)
if (scheduler.isInstanceOf[org.apache.spark.scheduler.TaskSchedulerImpl]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to add a method to TaskScheduler for this.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26800 has started for PR 4363 at commit b5c0441.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26800 has finished for PR 4363 at commit b5c0441.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26800/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26802 has started for PR 4363 at commit fb5df97.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26802 has finished for PR 4363 at commit fb5df97.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26802/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26804 has started for PR 4363 at commit e197e20.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26804 has finished for PR 4363 at commit e197e20.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26804/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26808 has started for PR 4363 at commit c1dfda1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26808 has finished for PR 4363 at commit c1dfda1.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26808/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26810 has started for PR 4363 at commit bccd515.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26810 has finished for PR 4363 at commit bccd515.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.util.ActorLogReceive

/**
* A heartbeat from executors to the driver. This is a shared message used by several internal
* components to convey liveness or execution information for in-progress tasks.
* components to convey liveness or execution information for in-progress tasks. It will also
* expire the hosts that have not heartbeated for more than spark.driver.executorTimeoutMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need to update this comment if you rename the configs

@andrewor14
Copy link
Contributor

@shenh062326 Thanks for addressing the feedback. I think the current solution works, and the comments I left inline are mostly minor. I will merge this once you address all of them.

@shenh062326
Copy link
Contributor Author

Sorry for late, I will change it.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27993 has started for PR 4363 at commit 2dc456e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27993 has finished for PR 4363 at commit 2dc456e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27993/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27997 has started for PR 4363 at commit 1a042ff.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27997 has finished for PR 4363 at commit 1a042ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27997/
Test PASSed.

// executor ID -> timestamp of when the last heartbeat from this executor was received
private val executorLastSeen = new mutable.HashMap[String, Long]

private val executorTimeout = sc.conf.getLong("spark.network.timeoutMs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the existing config spark.network.timeout without the Ms. The reason why I suggested using this config is to avoid introducing new ones.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28030 has started for PR 4363 at commit 2c9a46a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28030 has finished for PR 4363 at commit 2c9a46a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28030/
Test PASSed.

@andrewor14
Copy link
Contributor

Great, LGTM thanks for your work @shenh062326 I'm merging this into master.

@asfgit asfgit closed this in 18f2098 Feb 27, 2015
alexrovner pushed a commit to alexrovner/spark that referenced this pull request Apr 29, 2015
If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent.  We should expireDeadHosts at HeartbeatReceiver.

Author: Hong Shen <hongshen@tencent.com>

Closes apache#4363 from shenh062326/my_change3 and squashes the following commits:

2c9a46a [Hong Shen] Change some code style.
1a042ff [Hong Shen] Change some code style.
2dc456e [Hong Shen] Change some code style.
d221493 [Hong Shen] Fix test failed
7448ac6 [Hong Shen] A minor change in sparkContext and heartbeatReceiver
b904aed [Hong Shen] Fix failed test
52725af [Hong Shen] Remove assert in SparkContext.killExecutors
5bedcb8 [Hong Shen] Remove assert in SparkContext.killExecutors
a858fb5 [Hong Shen] A minor change in HeartbeatReceiver
3e221d9 [Hong Shen] A minor change in HeartbeatReceiver
6bab7aa [Hong Shen] Change a code style.
07952f3 [Hong Shen] Change configs name and code style.
ce9257e [Hong Shen] Fix test failed
bccd515 [Hong Shen] Fix test failed
8e77408 [Hong Shen] Fix test failed
c1dfda1 [Hong Shen] Fix test failed
e197e20 [Hong Shen] Fix test failed
fb5df97 [Hong Shen] Remove ExpireDeadHosts in BlockManagerMessages
b5c0441 [Hong Shen] Remove expireDeadHosts in BlockManagerMasterActor
c922cb0 [Hong Shen] Add expireDeadHosts in HeartbeatReceiver
asfgit pushed a commit that referenced this pull request Apr 30, 2015
If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent.  We should expireDeadHosts at HeartbeatReceiver.

Author: Hong Shen <hongshentencent.com>

Closes #4363 from shenh062326/my_change3 and squashes the following commits:

2c9a46a [Hong Shen] Change some code style.
1a042ff [Hong Shen] Change some code style.
2dc456e [Hong Shen] Change some code style.
d221493 [Hong Shen] Fix test failed
7448ac6 [Hong Shen] A minor change in sparkContext and heartbeatReceiver
b904aed [Hong Shen] Fix failed test
52725af [Hong Shen] Remove assert in SparkContext.killExecutors
5bedcb8 [Hong Shen] Remove assert in SparkContext.killExecutors
a858fb5 [Hong Shen] A minor change in HeartbeatReceiver
3e221d9 [Hong Shen] A minor change in HeartbeatReceiver
6bab7aa [Hong Shen] Change a code style.
07952f3 [Hong Shen] Change configs name and code style.
ce9257e [Hong Shen] Fix test failed
bccd515 [Hong Shen] Fix test failed
8e77408 [Hong Shen] Fix test failed
c1dfda1 [Hong Shen] Fix test failed
e197e20 [Hong Shen] Fix test failed
fb5df97 [Hong Shen] Remove ExpireDeadHosts in BlockManagerMessages
b5c0441 [Hong Shen] Remove expireDeadHosts in BlockManagerMasterActor
c922cb0 [Hong Shen] Add expireDeadHosts in HeartbeatReceiver

Author: Hong Shen <hongshen@tencent.com>

Closes #5793 from alexrovner/SPARK-5529-backport-1.3-v2 and squashes the following commits:

f238f94 [Hong Shen]  [SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants