Quackpack, scheduling utils

Installing

Install the repo from source

git clone git@github.com:puigde/quackpack.git
pip install .

Using

Basic Scheduling:

The main function in quackpack is schedule_cmd_gpus, its main arguments are:

def schedule_cmd_gpus
    command_list: List[GpuJobInfo], # List of commands to schedule
    timeout_s: int = 60 * 60 * 24 * 7, # Timeout for successful scheduling of commands
    sleep_s: int = 3, # Waiting time between successful scheduling of commands

GpuJobInfo contains a string with the job cmd and an integer representing the gpu memory capacity of a job.

GpuJobInfo = namedtuple("GpuJobInfo", ["cmd", "total_mbs_gpus"])

Quackpack will try to fit the job memory in homogeneous chunks in the minimum amount of gpus possible and to the ones that are more free.

Example: scheduling ten experiments with 10k mbs each and a waiting time of 30 seconds between schedules with the default timeout (1w)

import quackpack as qp
cmd_list = ["python experiment_{i}.py" for i in range(1,11)]
cmd_list = [qp.GpuJobInfo(it_cmd, 10*1000) for it_cmd in command_list]
qp.schedule_cmd_gpus(cmd_list, sleep_s=30)

Other arguments:

def schedule_cmd_gpus
    ...
    command_apply_functions: List[Callable] = None, # list of functions that get cmd string
    # and an environment dict and return a modified cmd string for launchtime
    debug_on_crash: bool = False, # will open a python debugger if crashed

We can dynamically adapt our commands with command_apply_functions, for instance, when quackpack schedules into multi-gpu we might want to update the number of processes per node in our torch.distributed.launch command accordingly. This is done by looking at the CUDA_VISIBLE_DEVICES env var which quackpack sets when scheduling.

def fresh_n_proc_per_node(cmd: str, env):
    return insert_string(
        cmd,
        "torch.distributed.launch",
        f"--nproc_per_node={len(env['CUDA_VISIBLE_DEVICES'].split(','))}",
    )

where insert string is a simple utility function that puts a string after another in the command

def insert_string(total_str, previous_str, insert_str):
    total_str = total_str.split()
    total_str.insert(total_str.index(previous_str) + 1, insert_str)
    return " ".join(total_str)

Both functions are found in quackpack and can be called directly with qp.<function> other examples:

fresh_port_mod_fn: adds '--master_port={port}' where port is a free port after 'torch.distributed.launch'

Example: scheduling a torch.distributed.launch with 100Gbs budget with dynamic number of processes and nodes.

import quackpack as qp
cmd = "python -m torch.distributed.launch experiment_1.py --some_arg"
qp.schedule_cmd_gpus(
    command_list=[qp.GpuJobInfo(cmd, 100*1000)],
    command_apply_functions=[qp.fresh_port_mod_fn, qp.fresh_n_proc_per_node],
)

Development

Install dev dependencies

pip install pre-commit black

Install pre-commit hooks

pre-commit install

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
quackpack		quackpack
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
quackpack.jpg		quackpack.jpg
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quackpack, scheduling utils

Installing

Using

Basic Scheduling:

Development

About

Languages

License

puigde/quackpack

Folders and files

Latest commit

History

Repository files navigation

Quackpack, scheduling utils

Installing

Using

Basic Scheduling:

Development

About

Topics

Resources

License

Stars

Watchers

Forks

Languages