Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override _new_worker_name to make using Job Arrays easier #480

Merged
merged 3 commits into from
Jan 20, 2021

Conversation

jmuchovej
Copy link
Contributor

This is based on what I've found to work on a SLURM cluster (https://openmind.mit.edu) and reported in dask/dask#7070 (comment) as an attempt to address those errors when submitting Job Arrays, using a variant of #196 (comment).

@lesteve
Copy link
Member

lesteve commented Jan 19, 2021

That's an interesting idea indeed to make it slightly less convoluted to use job arrays with Dask-Jobqueue. Although it was too invasive at one point to make it at the SpecCluster (in distributed), I think that's probably fine to do it at the JobQueueCluster level (in dask-jobqueue).

Could you add some tests to make sure that the cluster name reaches the worker name? I think something close to the job array use case would be:

cluster = SLURMCluster(name='test-$MY_ENV_VARIABLE', env_extra="MY_ENV_VARIABLE=my-env-variable-value", ...)

You can then get the worker names

[w['id'] for w in client.scheduler_info()['workers'].values()]

and check that they look like "my-env-variable-value-"

@lesteve
Copy link
Member

lesteve commented Jan 19, 2021

By the way about the CI failures:

  • the SGE timeout is known it has been like this for some months now, and I haven't found the time to investigate
  • the CI / build (none) is likely because of black issues

@jmuchovej
Copy link
Contributor Author

jmuchovej commented Jan 19, 2021

CI:

  • the CI / build (none) is likely because of black issues

I see this now, based on the CI, I'll have to fix:

dask_jobqueue/core.py:592:1: W293 blank line contains whitespace
dask_jobqueue/core.py:595:1: W293 blank line contains whitespace
dask_jobqueue/core.py:602:1: W293 blank line contains whitespace

Testing:

As for the tests, okay, wasn't too sure of where/what to put. Thanks for the ideas!

Just to be sure, testing code should be placed in test_jobqueue_core.py, right? I'll have to look at this over the coming weekend.

@lesteve
Copy link
Member

lesteve commented Jan 19, 2021

I ended up trying to add a test, remembering the testing setup was not that trivial to get up and running (docker and docker-compose setup to mimick SLURM, SGE, etc ... clusters) and eventually pushing into your branch.

Because this test needs Dask workers it needs a real cluster (i.e. you need to .scale in other words launch some jobs) and test_jobqueue_core.py is more for tests that don't need a real cluster (for example you only need to check the submission script i.e. cluster.job_script()). It was simpler to add it only in test_slurm.py basically.

Let's see what the CI has to say about my last commit 🤞

@lesteve lesteve changed the title Add Worker name to create uniqueness, allowing for Job Arrays Override _new_worker_name to make using Job Arrays easier Jan 20, 2021
@lesteve lesteve merged commit ea67a99 into dask:master Jan 20, 2021
@lesteve
Copy link
Member

lesteve commented Jan 20, 2021

Merging thanks a lot for this!

@jmuchovej
Copy link
Contributor Author

Bit late, but it may be worth leaving the original "NOTE" in the docs? Unless the expectation is that folks will search prior to using job arrays.

@lesteve
Copy link
Member

lesteve commented Jan 20, 2021

I think people will find it in the github issue / github discussion by googling with more context.

I felt like it was a bit too much info hidden in a docstring of an obscure method ...

@guillaumeeb
Copy link
Member

If you thinks this could be helpful to other people, you might think of documenting the use of Job arrays in Slurm in the docs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants