do not lock up addprocs on worker setup errors #32290

tanmaykm · 2019-06-11T11:02:51Z

create_worker (invoked during addprocs) waits on a message from the worker to indicate success. If the worker process terminates before sending this response, create_worker (and therefore addprocs) will remain locked up.

Usually the master process does become aware of a terminated worker, when the communication channel between them breaks due to worker exiting. The message processing loop exits as a result. But create_worker keeps on waiting though.

This commit introduces an additional task (timer) that monitors this message processing loop while master is waiting for a JoinCompleteMsg response from a worker. It makes create_worker return both when setup is successful (master receives a JoinCompleteMsg) and also when worker is terminated. Return value of create_worker is 0 when worker setup fails, instead of worker id when it is successful. Return value of addprocs contains only workers that were successfully launched and connected to. Added tests.

stdlib/Distributed/src/cluster.jl

stdlib/Distributed/test/distributed_exec.jl

This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.

stdlib/Distributed/src/cluster.jl

stdlib/Distributed/test/distributed_exec.jl

This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.

…c on worker setup errors

amitmurthy · 2019-06-14T02:22:19Z

I just realized that with this PR, now there is a difference in the way that we handle errors during initial master-worker connection setup and errors during the handshake or later (while addprocs has still not returned).

Should we just

print warmings for each failed worker startup (under any circumstances) and
document that addprocs may return less than the requested number of workers?

In any case the fact that addprocs may return less than the requested number of workers needs to be documented. While OK going forward, may not be a good idea to backport it?

amitmurthy · 2019-06-14T02:26:59Z

Needs docs and further discussions before merging.

tanmaykm · 2019-06-14T03:32:12Z

Agree. I would wait a while for conclusion of further opinions/discussions before updating docs.

This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.

tanmaykm · 2019-08-14T03:27:08Z

Not sure why CI is failing, though tests pass on my local machine. Will investigate this.

amitmurthy · 2019-08-14T05:19:20Z

Putting this up for discussion - Shouldn't worker connect or reading host/port errors be converted to warnings and ignored too? i.e., when we cannot cannot connect to some newly launched workers addprocs will return a count of the actual workers launched and print warnings instead of throwing errors as it currently does.

For example the errors thrown here and here and tested here.

tanmaykm · 2019-08-14T05:56:59Z

Looks like connect to non-routable IP 203.0.113.0 used to simulate a connect timeout fails immediately instead of blocking in CI environment (it works fine locally though). Check CONNECT FAILED in 0.05166292190551758 secs in build logs here.

I do not see any way to simulate that condition reliably. So I'll ~~comment that test out for now~~ move it under JULIA_TESTFULL=1 condition instead.

tanmaykm · 2019-08-14T06:06:37Z

And yes. Seems to me that we can make some/all of the worker connect errors into warnings now.

tanmaykm · 2019-08-14T10:14:27Z

CI has passed except one failure in buildbot/package_win32, which seems unrelated. ~~Can someone trigger that again?~~ Triggered it again

tanmaykm · 2019-08-15T00:42:27Z

CI is passing now. We can take up and discuss #32290 (comment) as a separate PR.

Does this look okay to merge?

tanmaykm · 2019-08-17T06:00:50Z

Discovered that addprocs can also lock up if connect to worker freezes here. Will add more changes to fix that in a bit.

tanmaykm · 2019-08-17T06:06:45Z

Also master should not exit with an error if it failed to issue a remote kill to a deregistered worker. Will make changes for that too.

vtjnash · 2020-02-24T18:26:50Z

stdlib/Distributed/test/distributed_exec.jl

+end
+@test length(npids) == 0
+@test nprocs() == 1
+@test Distributed.worker_timeout() < t < 360.0


Adding up to 6 minutes to the test is probably unacceptable. We should use the existing environment variable to set this timeout to something very small.

Agree. Pushing a fix in a bit.

done in c399338

This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.

vtjnash · 2021-04-13T17:44:16Z

Bump. We fixed most CI issues, so could you push a rebase and we can try to review this soon

`create_worker` (invoked during `addprocs`) waits on a message from the worker to indicate success. If the worker process terminates before sending this response, `create_worker` (and therefore `addprocs`) will remain locked up. Usually the master process does become aware of a terminated worker, when the communication channel between them breaks due to worker exiting. The message processing loop exits as a result. This commit introduces an additional task (timer) that monitors this message processing loop while master is waiting for a `JoinCompleteMsg` response from a worker. It makes `create_worker` return both when setup is successful (master receives a `JoinCompleteMsg`) and also when worker is terminated. Return value of `create_worker` is 0 when worker setup fails, instead of worker id when it is successful. Return value of `addprocs` contains only workers that were successfully launched and connected to. Added some tests for that too.

This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.

do not throw a fatal error if we could not issue a remote exit to kill the worker while deregistering.

It is possible for a `connect` call from master to worker during worker setup to hang indefinitely. This adds a timeout to handle that, so that master does nto lock up as a result. It simply deregisters and terminates the worker and carries on.

- show exception along with message when kill fails - but do not warn if error is ProcessExitedException for the same process

also the additional async task for timeout introduced in JuliaLang#34502 will not be required, because this PR handles that already and also differentiates between timeout and error.

tanmaykm · 2021-04-14T10:24:21Z

@vtjnash It is now rebased. Couldn't figure out why windows and macos tests failed though.

vtjnash · 2021-04-14T18:21:34Z

It looks like you changed the setup code for that test to remove all workers instead of adding them?

stdlib/Distributed/test/distributed_exec.jl

ViralBShah · 2022-02-25T00:17:22Z

@tanmaykm Bumping on this old thread. Is it still ok to merge?

@vchuravy Can you review?

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

ViralBShah · 2022-03-13T14:59:54Z

@tanmaykm Any thoughts here? It's an old PR, but if it is still good to go, should we put some effort into getting it merged?

vtjnash · 2024-02-11T00:05:00Z

Moved to JuliaLang/Distributed.jl#61

tanmaykm requested review from amitmurthy and vtjnash June 11, 2019 12:51

amitmurthy reviewed Jun 12, 2019

View reviewed changes

stdlib/Distributed/src/cluster.jl Outdated Show resolved Hide resolved

stdlib/Distributed/test/distributed_exec.jl Show resolved Hide resolved

amitmurthy reviewed Jun 12, 2019

View reviewed changes

stdlib/Distributed/src/cluster.jl Outdated Show resolved Hide resolved

stdlib/Distributed/test/distributed_exec.jl Outdated Show resolved Hide resolved

tanmaykm force-pushed the tan/par2 branch from 2d2649c to c0ecfb4 Compare June 12, 2019 16:12

phyatt-corp added a commit to Conning/julia that referenced this pull request Jun 12, 2019

Backported PR JuliaLang#32290 JuliaLang#32290 , do not lock up addpro…

0c9f1f1

…c on worker setup errors

tanmaykm changed the title ~~do not lock up addproc on worker setup errors~~ do not lock up addprocs on worker setup errors Jun 12, 2019

amitmurthy approved these changes Jun 13, 2019

View reviewed changes

tanmaykm added the domain:parallelism Parallel or distributed computation label Jul 2, 2019

tanmaykm force-pushed the tan/par2 branch from 318db0a to ae9a304 Compare August 12, 2019 10:05

tanmaykm force-pushed the tan/par2 branch from afcaf14 to 1802efc Compare August 14, 2019 08:45

tanmaykm requested a review from amitmurthy August 18, 2019 03:10

vtjnash reviewed Feb 24, 2020

View reviewed changes

tanmaykm added 13 commits April 14, 2021 11:44

kill worker for shorter test run time

a41d3e1

document new behavior of addprocs

984202c

test connect timeout only when JULIA_TESTFULL==1

a0184b1

not fatal to fail to kill worker during deregister

f4d111e

do not throw a fatal error if we could not issue a remote exit to kill the worker while deregistering.

add tests for connect timeout

c8a61ae

reduce test time for connect timeout test

91bd3de

rebase changes from master

fad9307

show exception along with message when kill fails

d2b24df

- show exception along with message when kill fails - but do not warn if error is ProcessExitedException for the same process

add a verbose flag to switch off unnecessary msgs

9452c99

close timer on exception to put into rr_ntfy_join

3319a29

also the additional async task for timeout introduced in JuliaLang#34502 will not be required, because this PR handles that already and also differentiates between timeout and error.

tanmaykm force-pushed the tan/par2 branch from c7ceaea to 3319a29 Compare April 14, 2021 06:15

vtjnash reviewed Apr 14, 2021

View reviewed changes

stdlib/Distributed/test/distributed_exec.jl Show resolved Hide resolved

vtjnash reviewed Apr 14, 2021

View reviewed changes

stdlib/Distributed/test/distributed_exec.jl Outdated Show resolved Hide resolved

vtjnash requested a review from vchuravy April 14, 2021 19:08

tanmaykm added 2 commits April 15, 2021 17:09

check exception type while testing connect timeout

60cd4e3

ensure we have workers for tests

c8c44ae

ViralBShah and others added 2 commits March 13, 2022 10:58

Update stdlib/Distributed/test/distributed_exec.jl

75a56fe

Co-authored-by: Jameson Nash <vtjnash@gmail.com>

Merge branch 'master' into tan/par2

0d86591

vtjnash mentioned this pull request Feb 11, 2024

do not lock up addprocs on worker setup errors (with PR) JuliaLang/Distributed.jl#61

Open

vtjnash closed this Feb 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not lock up addprocs on worker setup errors #32290

do not lock up addprocs on worker setup errors #32290

tanmaykm commented Jun 11, 2019

amitmurthy commented Jun 14, 2019

amitmurthy commented Jun 14, 2019

tanmaykm commented Jun 14, 2019

tanmaykm commented Aug 14, 2019

amitmurthy commented Aug 14, 2019

tanmaykm commented Aug 14, 2019 •

edited

Loading

tanmaykm commented Aug 14, 2019

tanmaykm commented Aug 14, 2019 •

edited

Loading

tanmaykm commented Aug 15, 2019

tanmaykm commented Aug 17, 2019

tanmaykm commented Aug 17, 2019

vtjnash Feb 24, 2020

tanmaykm Feb 25, 2020

tanmaykm Feb 25, 2020

vtjnash commented Apr 13, 2021

tanmaykm commented Apr 14, 2021

vtjnash commented Apr 14, 2021

ViralBShah commented Feb 25, 2022 •

edited

Loading

ViralBShah commented Mar 13, 2022

vtjnash commented Feb 11, 2024

do not lock up addprocs on worker setup errors #32290

do not lock up addprocs on worker setup errors #32290

Conversation

tanmaykm commented Jun 11, 2019

amitmurthy commented Jun 14, 2019

amitmurthy commented Jun 14, 2019

tanmaykm commented Jun 14, 2019

tanmaykm commented Aug 14, 2019

amitmurthy commented Aug 14, 2019

tanmaykm commented Aug 14, 2019 • edited Loading

tanmaykm commented Aug 14, 2019

tanmaykm commented Aug 14, 2019 • edited Loading

tanmaykm commented Aug 15, 2019

tanmaykm commented Aug 17, 2019

tanmaykm commented Aug 17, 2019

vtjnash Feb 24, 2020

Choose a reason for hiding this comment

tanmaykm Feb 25, 2020

Choose a reason for hiding this comment

tanmaykm Feb 25, 2020

Choose a reason for hiding this comment

vtjnash commented Apr 13, 2021

tanmaykm commented Apr 14, 2021

vtjnash commented Apr 14, 2021

ViralBShah commented Feb 25, 2022 • edited Loading

ViralBShah commented Mar 13, 2022

vtjnash commented Feb 11, 2024

tanmaykm commented Aug 14, 2019 •

edited

Loading

tanmaykm commented Aug 14, 2019 •

edited

Loading

ViralBShah commented Feb 25, 2022 •

edited

Loading