-
-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SLURMRunner from jacobtomlinson/dask-hpc-runners #659
Conversation
I'm going to mark this as ready to review as all the tehnical work is done and the tests are passing (ignore the failed SGE test #653). It still needs docs but I'd love to get a review, even just a high level design review, before I dive into writing up the documentation. @guillaumeeb @lesteve @kmpaul if any of you have time to take a look I'd really appreciate it! |
I've gone over the documentation and given is an overhaul to make space for the new features.
|
Other maintainers of this repo expressed to me via email they they have limited capacity for reviews at the moment. To avoid being blocked on review any longer I'm going to self-merge this. But if anyone wants to revisit this work with a review down the line then I would encourage you to open an issue where we can discuss things. |
Towards #638
As we decided that the code from jacobtomlinson/dask-hpc-runners would be best to live in this repo this PR adds the
SLURMRunner
along with the base class and other core code.Design
The core assumption in the design of the runner model is that the same script will be executed many times by a job scheduler.
Within the script the runner class is created.
This will result in multiple processes runnning on an HPC that are all instantiating the runner class.
The processes need to coordinate to decide which process should run the Dask Scheduler, which should be Dask Workers and which should continue running the rest of the client code within the script. This coordination happens during the
__init__()
of the runner class.The Scheduler and Worker processes exit after they complete to avoid running the client code multiple times. This means that only one of the processes will continue past the
__init__()
of the runner class, the rest will exit at that point after the work is done.Base class
In the new file
runner.py
contains theBaseRunner
class that can be used for implementing other runners. It also includes anAsyncRunner
class that is used for testing.The minimum required to implement a new runner is the following methods.
The
BaseRunner
class handles starting up Dask once these methods have been implemented. It also provides many stubbed out hooks to allow you to write code that runs before/after each component is created. E.gBaseRunner.before_scheduler_start()
,BaseRunner.before_worker_start()
andBaseRunner.before_client_start()
.The runner must know the address of the scheduler so that it can coordinate the clean shutdown of all processes when we reach the end of the code (either via
__exit__()
or a finalizer). This communication happens independently of any clients that may be created.Slurm implementation
This PR also adds a Slurm implementation to
slurm.py
.In the
get_role()
method I use theSLURM_PROCID
environment variable to infer the role.I also add a default scheduler option to set the
scheduler_file="scheduler-{job_id}.json"
and I look up the Job ID from theSLURM_JOB_ID
environment variable to ensource uniqueness. This effectively allows us to broadcast the scheduler address via the shared filesystem.Then in the
get_scheduler_address()
method I wait for the scheduler file to exist and then open and read the address from the scheduler file in the same way thedask.distributed.Client
does.Example
Then I can submit this script via
srun
or ansbatch
script.TODO