Replies: 1 comment
-
@doutriaux1 @jwhite242 @koning I would love to start a discussion and get the ball rolling on this. Please let me know your thoughts. Apologies for the length of the post, I tried to encapsulate most of what Luc and I had discussed via teams on this subject. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem Description
Users often struggle with the setup of workers and how they are distributed across an allocation. It would be nice if we had a more user-friendly way to describe worker launch instructions.
Current Setup
Currently we rely on users to create:
merlin run-workers
a. Each worker must define the total number of nodes to run on (defaults to 'all')
b. Each worker must also define the number of workers to start on each node (the concurrency value)
Example
A user creates the following spec file:
They also create the following worker launch script:
Here we see that 8 nodes are requested for the entire allocation (from the worker launch script header), 1 node is requested by
step_1_workers
, 2 nodes are requested bystep_2_workers
, 2 nodes are requested per task ofstep_1
, and 1 node is requested per task ofstep_2
... What a mess!!This example is getting us an 8 node allocation. It is then starting 4
step_1_workers
on one node (in other words, 4 cores of this node will be used). Simultaneously, it is starting 2step_2_workers
across 2 nodes (in other words, one core on one node will host astep_2_worker
and so will one core on another node). In total that gives us 6 cores being used across 3 nodes for our workers.On top of this, users also have to consider how the scheduler settings in each step will behave with their allocation. As you can see, for a beginning Merlin user this can be very confusing (hell it even gets confusing for me!).
Proposed Solution
This solution has 3 parts that I can think of:
I'll discuss each in more detail below.
Generating/Launching Worker-Launch Script
It may be easier for users if we handle the generation and launching of a batch script for them. This will help them keep the definition of their entire workflow (workers included) to just the spec file.
The way I see it, there are 2 options for this:
merlin run-workers
commandmerlin create-batch
My only concern with creating a new command is that just adds one more command required for users to execute their workflow.
Either way here, behind the scenes the command would generate the header of the batch file from the batch block of the spec file. For example, consider the following batch block:
On a non-flux native machine, this would generate a
workers.sbatch
file with the following header:This could then grab the current path to merlin and add it to the PATH variable. Additionally this would establish the correct celery launch commands to put in the batch script (more in the next section).
Generalizing the Worker Settings
Right now, determining how many workers you have requires you to specify both
nodes
andconcurrency
for each worker. It would be easier if we were instead able to say something like "I want worker 1 to have 50 workers and worker 2 to have 10". In this scenario, the user would not know how many nodes the worker is run on; they would only know that they're getting the correct amount of workers.The implementation of this could look something like:
With a setup like this, Merlin could then spread the workers out as necessary across cores of the allocation.
Combining this with the previous section, we could generate the correct scheduler launch command (flux run, srun, etc.) to spread workers across the allocation as necessary. These scheduler launch commands could be written to the worker-launch script to replace where the
merlin run-workers
command typically goes.(Optional) Determine Allocation Size of Workflow
This part of the solution is more of an optional piece. Instead of grabbing the number of nodes from the batch block of the spec, we could instead calculate the number of nodes/cores in an allocation for the user based on:
Since each worker is assigned to certain steps of the workflow, then we can look at each step they're assigned to and determine the max number of nodes/cores that each worker will need. We can use these values to determine the total size of the allocation for users rather than relying on them to get it right.
Special thanks to @lucpeterson for starting this long discussion with me.
Beta Was this translation helpful? Give feedback.
All reactions