Changes to Worker Launch Process #485

bgunnar5 · 2024-06-05T23:15:45Z

bgunnar5
Jun 5, 2024
Maintainer

Problem Description

Users often struggle with the setup of workers and how they are distributed across an allocation. It would be nice if we had a more user-friendly way to describe worker launch instructions.

Current Setup

Currently we rely on users to create:

A batch script that defines the total number of nodes/cores in an allocation and launches the workers with merlin run-workers
A workers block in their spec file that defines multiple workers
a. Each worker must define the total number of nodes to run on (defaults to 'all')
b. Each worker must also define the number of workers to start on each node (the concurrency value)

Example

A user creates the following spec file:

description:
    name: example_wf
    description: an example wf to show complexity of worker launch

batch:
    type: flux

study:
- name: step_1
  description: step 1
  run:
    cmd: $(LAUNCHER) echo "do something"
    nodes: 2
    cores: 8
- name: step_2
  description: step 2
  run:
    cmd: $(LAUNCHER) echo "do something else"
    nodes: 1
    gpus: 4 

merlin:
    resources:
        workers:
            step_1_workers:
                args: -l INFO --concurrency 4 --prefetch-multiplier 1 -Ofair
                steps: [step_1]
                nodes: 1
            step_2_workers:
                args: -l INFO --concurrency 1 --prefetch-multiplier 1 -Ofair
                steps: [step_2]
                nodes: 2

They also create the following worker launch script:

#!/bin/bash
#SBATCH -N 8
#SBATCH -J Merlin
#SBATCH -t 00:20:00
#SBATCH -p pdebug
#SBATCH -A wbronze
#SBATCH -o merlin_workers_%j.out

YAML=example_wf.yaml 

VENV=/path/to/merlin_venv

# Activate the virtual environment
source ${VENV}/bin/activate

#########################################
#          Running the Workers          #
#########################################

# Show the workers command
merlin run-workers ${YAML} --echo

# Start workers to run the tasks in the broker
merlin run-workers ${YAML}

# Keep the allocation alive until all workers stop
merlin monitor ${YAML}

Here we see that 8 nodes are requested for the entire allocation (from the worker launch script header), 1 node is requested by step_1_workers, 2 nodes are requested by step_2_workers, 2 nodes are requested per task of step_1, and 1 node is requested per task of step_2... What a mess!!

This example is getting us an 8 node allocation. It is then starting 4 step_1_workers on one node (in other words, 4 cores of this node will be used). Simultaneously, it is starting 2 step_2_workers across 2 nodes (in other words, one core on one node will host a step_2_worker and so will one core on another node). In total that gives us 6 cores being used across 3 nodes for our workers.

On top of this, users also have to consider how the scheduler settings in each step will behave with their allocation. As you can see, for a beginning Merlin user this can be very confusing (hell it even gets confusing for me!).

Proposed Solution

This solution has 3 parts that I can think of:

Generating and launching the batch script for users
Create more generalized settings for specifying the number of workers you want
(Optional) Determine allocation size for our users based on worker settings and scheduler specific settings within steps

I'll discuss each in more detail below.

Generating/Launching Worker-Launch Script

It may be easier for users if we handle the generation and launching of a batch script for them. This will help them keep the definition of their entire workflow (workers included) to just the spec file.

The way I see it, there are 2 options for this:

Expanding existing functionality of the merlin run-workers command
Creating a new command entirely; maybe something like merlin create-batch

My only concern with creating a new command is that just adds one more command required for users to execute their workflow.

Either way here, behind the scenes the command would generate the header of the batch file from the batch block of the spec file. For example, consider the following batch block:

batch:
    type: flux
    nodes: 8
    walltime: "00:20:00"
    queue: pdebug
    bank: wbronze

On a non-flux native machine, this would generate a workers.sbatch file with the following header:

#!/bin/bash
#SBATCH -N 8
#SBATCH -t 00:20:00
#SBATCH -p pdebug
#SBATCH -A wbronze
#SBATCH -J Merlin  -> generated by default
#SBATCH -o merlin_workers_%j.out  -> generated by default

This could then grab the current path to merlin and add it to the PATH variable. Additionally this would establish the correct celery launch commands to put in the batch script (more in the next section).

Generalizing the Worker Settings

Right now, determining how many workers you have requires you to specify both nodes and concurrency for each worker. It would be easier if we were instead able to say something like "I want worker 1 to have 50 workers and worker 2 to have 10". In this scenario, the user would not know how many nodes the worker is run on; they would only know that they're getting the correct amount of workers.

The implementation of this could look something like:

merlin:
    resources:
        workers:
            simworkers:
                number: 50
                steps: [step_1]
            nonsimworkers:
                number: 10
                steps: [step_2, step_3]

With a setup like this, Merlin could then spread the workers out as necessary across cores of the allocation.

Combining this with the previous section, we could generate the correct scheduler launch command (flux run, srun, etc.) to spread workers across the allocation as necessary. These scheduler launch commands could be written to the worker-launch script to replace where the merlin run-workers command typically goes.

(Optional) Determine Allocation Size of Workflow

This part of the solution is more of an optional piece. Instead of grabbing the number of nodes from the batch block of the spec, we could instead calculate the number of nodes/cores in an allocation for the user based on:

The number of workers
The scheduler settings in each step

Since each worker is assigned to certain steps of the workflow, then we can look at each step they're assigned to and determine the max number of nodes/cores that each worker will need. We can use these values to determine the total size of the allocation for users rather than relying on them to get it right.

Special thanks to @lucpeterson for starting this long discussion with me.

bgunnar5 · 2024-06-05T23:27:13Z

bgunnar5
Jun 5, 2024
Maintainer Author

@doutriaux1 @jwhite242 @koning I would love to start a discussion and get the ball rolling on this. Please let me know your thoughts. Apologies for the length of the post, I tried to encapsulate most of what Luc and I had discussed via teams on this subject.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to Worker Launch Process #485

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Changes to Worker Launch Process #485

bgunnar5 Jun 5, 2024 Maintainer

Problem Description

Current Setup

Example

Proposed Solution

Generating/Launching Worker-Launch Script

Generalizing the Worker Settings

(Optional) Determine Allocation Size of Workflow

Replies: 1 comment

bgunnar5 Jun 5, 2024 Maintainer Author

bgunnar5
Jun 5, 2024
Maintainer

bgunnar5
Jun 5, 2024
Maintainer Author