Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not lock up addprocs on worker setup errors #32290

Closed
wants to merge 17 commits into from

Conversation

tanmaykm
Copy link
Member

create_worker (invoked during addprocs) waits on a message from the worker to indicate success. If the worker process terminates before sending this response, create_worker (and therefore addprocs) will remain locked up.

Usually the master process does become aware of a terminated worker, when the communication channel between them breaks due to worker exiting. The message processing loop exits as a result. But create_worker keeps on waiting though.

This commit introduces an additional task (timer) that monitors this message processing loop while master is waiting for a JoinCompleteMsg response from a worker. It makes create_worker return both when setup is successful (master receives a JoinCompleteMsg) and also when worker is terminated. Return value of create_worker is 0 when worker setup fails, instead of worker id when it is successful. Return value of addprocs contains only workers that were successfully launched and connected to. Added tests.

tanmaykm added a commit to tanmaykm/julia that referenced this pull request Jun 12, 2019
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof.

It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker.

`addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
stdlib/Distributed/src/cluster.jl Outdated Show resolved Hide resolved
stdlib/Distributed/test/distributed_exec.jl Outdated Show resolved Hide resolved
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Jun 12, 2019
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof.

It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker.

`addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
phyatt-corp added a commit to Conning/julia that referenced this pull request Jun 12, 2019
@tanmaykm tanmaykm changed the title do not lock up addproc on worker setup errors do not lock up addprocs on worker setup errors Jun 12, 2019
@amitmurthy
Copy link
Contributor

I just realized that with this PR, now there is a difference in the way that we handle errors during initial master-worker connection setup and errors during the handshake or later (while addprocs has still not returned).

Should we just

  1. print warmings for each failed worker startup (under any circumstances) and
  2. document that addprocs may return less than the requested number of workers?

In any case the fact that addprocs may return less than the requested number of workers needs to be documented. While OK going forward, may not be a good idea to backport it?

@amitmurthy
Copy link
Contributor

Needs docs and further discussions before merging.

@tanmaykm
Copy link
Member Author

Agree. I would wait a while for conclusion of further opinions/discussions before updating docs.

s2maki added a commit to Conning/julia that referenced this pull request Jun 15, 2019
This is a corollary to the previous commit in JuliaLang#32290, and implements 
suggestions thereof.

It restricts the master to wait for a worker to respond within 
`Distributed.worker_timeout()` seconds. Beyond that it releases the lock 
on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to 
`:ERROR` in case of any errors during worker setup, and to `:OK` when 
the master received a `JoinCompleteMsg` indicating setup completion from 
worker.

`addprocs` returns the worker id in the list of workers it added only if 
it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` 
contains `:OK`. Note that the worker process may not be dead yet, and it 
may still be listed in `workers()` until it actually goes down.
@tanmaykm tanmaykm added the domain:parallelism Parallel or distributed computation label Jul 2, 2019
s2maki pushed a commit to Conning/julia that referenced this pull request Jul 17, 2019
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof.

It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker.

`addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
s2maki pushed a commit to Conning/julia that referenced this pull request Jul 17, 2019
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof.

It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker.

`addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Aug 12, 2019
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof.

It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker.

`addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
@tanmaykm
Copy link
Member Author

Not sure why CI is failing, though tests pass on my local machine. Will investigate this.

@amitmurthy
Copy link
Contributor

Putting this up for discussion - Shouldn't worker connect or reading host/port errors be converted to warnings and ignored too? i.e., when we cannot cannot connect to some newly launched workers addprocs will return a count of the actual workers launched and print warnings instead of throwing errors as it currently does.

For example the errors thrown here and here and tested here.

@tanmaykm
Copy link
Member Author

tanmaykm commented Aug 14, 2019

Looks like connect to non-routable IP 203.0.113.0 used to simulate a connect timeout fails immediately instead of blocking in CI environment (it works fine locally though). Check CONNECT FAILED in 0.05166292190551758 secs in build logs here.

I do not see any way to simulate that condition reliably. So I'll comment that test out for now move it under JULIA_TESTFULL=1 condition instead.

@tanmaykm
Copy link
Member Author

And yes. Seems to me that we can make some/all of the worker connect errors into warnings now.

@tanmaykm
Copy link
Member Author

tanmaykm commented Aug 14, 2019

CI has passed except one failure in buildbot/package_win32, which seems unrelated. Can someone trigger that again? Triggered it again

@tanmaykm
Copy link
Member Author

CI is passing now. We can take up and discuss #32290 (comment) as a separate PR.

Does this look okay to merge?

@tanmaykm
Copy link
Member Author

Discovered that addprocs can also lock up if connect to worker freezes here. Will add more changes to fix that in a bit.

@tanmaykm
Copy link
Member Author

Also master should not exit with an error if it failed to issue a remote kill to a deregistered worker. Will make changes for that too.

end
@test length(npids) == 0
@test nprocs() == 1
@test Distributed.worker_timeout() < t < 360.0
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding up to 6 minutes to the test is probably unacceptable. We should use the existing environment variable to set this timeout to something very small.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Pushing a fix in a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c399338

tanmaykm added a commit to tanmaykm/julia that referenced this pull request May 13, 2020
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof.

It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker.

`addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 13, 2021

Bump. We fixed most CI issues, so could you push a rebase and we can try to review this soon

`create_worker` (invoked during `addprocs`) waits on a message from the worker to indicate success. If the worker process terminates before sending this response, `create_worker` (and therefore `addprocs`) will remain locked up.

Usually the master process does become aware of a terminated worker, when the communication channel between them breaks due to worker exiting. The message processing loop exits as a result.

This commit introduces an additional task (timer) that monitors this message processing loop while master is waiting for a `JoinCompleteMsg` response from a worker. It makes `create_worker` return both when setup is successful (master receives a `JoinCompleteMsg`) and also when worker is terminated. Return value of `create_worker` is 0 when worker setup fails, instead of worker id when it is successful. Return value of `addprocs` contains only workers that were successfully launched and connected to. Added some tests for that too.
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof.

It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker.

`addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
do not throw a fatal error if we could not issue a remote exit to kill the worker while deregistering.
It is possible for a `connect` call from master to worker during worker setup to hang indefinitely. This adds a timeout to handle that, so that master does nto lock up as a result. It simply deregisters and terminates the worker and carries on.
- show exception along with message when kill fails
- but do not warn if error is ProcessExitedException for the same process
also the additional async task for timeout introduced in JuliaLang#34502 will not be required, because this PR handles that already and also differentiates between timeout and error.
@tanmaykm
Copy link
Member Author

@vtjnash It is now rebased. Couldn't figure out why windows and macos tests failed though.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 14, 2021

It looks like you changed the setup code for that test to remove all workers instead of adding them?

@vtjnash vtjnash requested a review from vchuravy April 14, 2021 19:08
@ViralBShah
Copy link
Member

ViralBShah commented Feb 25, 2022

@tanmaykm Bumping on this old thread. Is it still ok to merge?

@vchuravy Can you review?

@ViralBShah
Copy link
Member

@tanmaykm Any thoughts here? It's an old PR, but if it is still good to go, should we put some effort into getting it merged?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 11, 2024

Moved to JuliaLang/Distributed.jl#61

@vtjnash vtjnash closed this Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants