-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver #4363
Conversation
add [SPARK-5529] |
Test build #26744 has started for PR 4363 at commit
|
Test build #26744 has finished for PR 4363 at commit
|
Test FAILed. |
Hi @shenh062326 , can you remove the dead host expiry in |
val msg = "Removing Executor " + executorId + " with no recent heart beats: " | ||
+(now - lastSeenMs) + "ms exceeds " + slaveTimeout + "ms" | ||
logWarning(msg) | ||
if (scheduler.isInstanceOf[org.apache.spark.scheduler.TaskSchedulerImpl]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to add a method to TaskScheduler
for this.
Test build #26800 has started for PR 4363 at commit
|
Test build #26800 has finished for PR 4363 at commit
|
Test FAILed. |
Test build #26802 has started for PR 4363 at commit
|
Test build #26802 has finished for PR 4363 at commit
|
Test FAILed. |
Test build #26804 has started for PR 4363 at commit
|
Test build #26804 has finished for PR 4363 at commit
|
Test FAILed. |
Test build #26808 has started for PR 4363 at commit
|
Test build #26808 has finished for PR 4363 at commit
|
Test FAILed. |
Test build #26810 has started for PR 4363 at commit
|
Test build #26810 has finished for PR 4363 at commit
|
import org.apache.spark.util.ActorLogReceive | ||
|
||
/** | ||
* A heartbeat from executors to the driver. This is a shared message used by several internal | ||
* components to convey liveness or execution information for in-progress tasks. | ||
* components to convey liveness or execution information for in-progress tasks. It will also | ||
* expire the hosts that have not heartbeated for more than spark.driver.executorTimeoutMs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll need to update this comment if you rename the configs
@shenh062326 Thanks for addressing the feedback. I think the current solution works, and the comments I left inline are mostly minor. I will merge this once you address all of them. |
Sorry for late, I will change it. |
Test build #27993 has started for PR 4363 at commit
|
Test build #27993 has finished for PR 4363 at commit
|
Test FAILed. |
Test build #27997 has started for PR 4363 at commit
|
Test build #27997 has finished for PR 4363 at commit
|
Test PASSed. |
// executor ID -> timestamp of when the last heartbeat from this executor was received | ||
private val executorLastSeen = new mutable.HashMap[String, Long] | ||
|
||
private val executorTimeout = sc.conf.getLong("spark.network.timeoutMs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use the existing config spark.network.timeout
without the Ms
. The reason why I suggested using this config is to avoid introducing new ones.
Test build #28030 has started for PR 4363 at commit
|
Test build #28030 has finished for PR 4363 at commit
|
Test PASSed. |
Great, LGTM thanks for your work @shenh062326 I'm merging this into master. |
If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent. We should expireDeadHosts at HeartbeatReceiver. Author: Hong Shen <hongshen@tencent.com> Closes apache#4363 from shenh062326/my_change3 and squashes the following commits: 2c9a46a [Hong Shen] Change some code style. 1a042ff [Hong Shen] Change some code style. 2dc456e [Hong Shen] Change some code style. d221493 [Hong Shen] Fix test failed 7448ac6 [Hong Shen] A minor change in sparkContext and heartbeatReceiver b904aed [Hong Shen] Fix failed test 52725af [Hong Shen] Remove assert in SparkContext.killExecutors 5bedcb8 [Hong Shen] Remove assert in SparkContext.killExecutors a858fb5 [Hong Shen] A minor change in HeartbeatReceiver 3e221d9 [Hong Shen] A minor change in HeartbeatReceiver 6bab7aa [Hong Shen] Change a code style. 07952f3 [Hong Shen] Change configs name and code style. ce9257e [Hong Shen] Fix test failed bccd515 [Hong Shen] Fix test failed 8e77408 [Hong Shen] Fix test failed c1dfda1 [Hong Shen] Fix test failed e197e20 [Hong Shen] Fix test failed fb5df97 [Hong Shen] Remove ExpireDeadHosts in BlockManagerMessages b5c0441 [Hong Shen] Remove expireDeadHosts in BlockManagerMasterActor c922cb0 [Hong Shen] Add expireDeadHosts in HeartbeatReceiver
If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent. We should expireDeadHosts at HeartbeatReceiver. Author: Hong Shen <hongshentencent.com> Closes #4363 from shenh062326/my_change3 and squashes the following commits: 2c9a46a [Hong Shen] Change some code style. 1a042ff [Hong Shen] Change some code style. 2dc456e [Hong Shen] Change some code style. d221493 [Hong Shen] Fix test failed 7448ac6 [Hong Shen] A minor change in sparkContext and heartbeatReceiver b904aed [Hong Shen] Fix failed test 52725af [Hong Shen] Remove assert in SparkContext.killExecutors 5bedcb8 [Hong Shen] Remove assert in SparkContext.killExecutors a858fb5 [Hong Shen] A minor change in HeartbeatReceiver 3e221d9 [Hong Shen] A minor change in HeartbeatReceiver 6bab7aa [Hong Shen] Change a code style. 07952f3 [Hong Shen] Change configs name and code style. ce9257e [Hong Shen] Fix test failed bccd515 [Hong Shen] Fix test failed 8e77408 [Hong Shen] Fix test failed c1dfda1 [Hong Shen] Fix test failed e197e20 [Hong Shen] Fix test failed fb5df97 [Hong Shen] Remove ExpireDeadHosts in BlockManagerMessages b5c0441 [Hong Shen] Remove expireDeadHosts in BlockManagerMasterActor c922cb0 [Hong Shen] Add expireDeadHosts in HeartbeatReceiver Author: Hong Shen <hongshen@tencent.com> Closes #5793 from alexrovner/SPARK-5529-backport-1.3-v2 and squashes the following commits: f238f94 [Hong Shen] [SPARK-5529][CORE]Add expireDeadHosts in HeartbeatReceiver
If a blockManager has not send heartBeat more than 120s, BlockManagerMasterActor will remove it. But coarseGrainedSchedulerBackend can only remove executor after an DisassociatedEvent. We should expireDeadHosts at HeartbeatReceiver.