Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster manager fixes #30172

Merged
merged 3 commits into from
Dec 29, 2018
Merged

cluster manager fixes #30172

merged 3 commits into from
Dec 29, 2018

Conversation

bjarthur
Copy link
Contributor

fixes #30031

the two commit messages say it all.

i'll happily add a third commit to fix the REPL being blocked during worker startup if someone gives me a hint as to what to change. this is a problem when the remote worker ends up in the pending queue on a busy cluster. in julia 0.6 the remote workers were launched in the background. i've tried making a few changed to the various @asyncs that are sprinked through the relevant code, but nothing has worked so far.

@bjarthur
Copy link
Contributor Author

@amitmurthy could you please review? thanks.

@vchuravy
Copy link
Sponsor Member

This looks okay, but it would be good to have a test

@kshyatt kshyatt added the domain:parallelism Parallel or distributed computation label Nov 28, 2018
@ViralBShah
Copy link
Member

@amitmurthy may not be watching.

@amitmurthy
Copy link
Contributor

Why don't we just remove "failed" from

println("\tFrom failed worker startup:\t", line)
and leave everything else as is? It will be good to print all output from the workers prior to reading the host/port info - even the informational stuff.

@bjarthur bjarthur force-pushed the bja/launchworkers branch 2 times, most recently from 7592ab8 to ef52319 Compare December 16, 2018 21:17
@bjarthur
Copy link
Contributor Author

bjarthur commented Dec 18, 2018

lastly, i've added documentation as to how to asynchronously launch workers.

it'd be great if this PR could be reviewed again. thanks for the previous comments. much better now.

Copy link
Sponsor Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

@bjarthur bjarthur force-pushed the bja/launchworkers branch 2 times, most recently from b8aa79f to 271e501 Compare December 27, 2018 20:33
@amitmurthy
Copy link
Contributor

LGTM

@ViralBShah ViralBShah merged commit 121e814 into JuliaLang:master Dec 29, 2018
@bjarthur bjarthur deleted the bja/launchworkers branch December 29, 2018 14:11
staticfloat pushed a commit that referenced this pull request Dec 30, 2018
* kill workers which don't launch properly

* don't emit spurious error messages

* document how to asynchronously launch workers
@KristofferC KristofferC mentioned this pull request Dec 30, 2018
53 tasks
KristofferC pushed a commit that referenced this pull request Dec 30, 2018
* kill workers which don't launch properly

* don't emit spurious error messages

* document how to asynchronously launch workers

(cherry picked from commit 121e814)
@bjarthur
Copy link
Contributor Author

thanks! @amitmurthy could you please review the related JuliaParallel/ClusterManagers.jl#74 too?

staticfloat pushed a commit that referenced this pull request Jan 4, 2019
* kill workers which don't launch properly

* don't emit spurious error messages

* document how to asynchronously launch workers
@StefanKarpinski StefanKarpinski added status:triage This should be discussed on a triage call backport 1.0 and removed status:triage This should be discussed on a triage call labels Jan 31, 2019
@JeffBezanson JeffBezanson removed backport 1.0 status:triage This should be discussed on a triage call labels Jan 31, 2019
@JeffBezanson
Copy link
Sponsor Member

We can backport this to 1.0 if it's critical for somebody's cluster environment, but for now we won't plan to.

Keno pushed a commit that referenced this pull request Jun 5, 2024
* kill workers which don't launch properly

* don't emit spurious error messages

* document how to asynchronously launch workers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failed worker startup
7 participants