This tool is freely inspired from the concepts used in Ansible and coded in python3 using the paramiko
library for SSH commands.
It relies on 3 types of yaml
files:
- hosts files, describing the hosts to connect to
- configuration files, describing the configuration for each job type hosts
- playbooks, describing a list of tasks to run on specific hosts
Hosts are remote servers to SSH to connect to to run tasks. Each host should be accessible through SSH without a passphrase.
raspi:
user: username
host: 192.168.123.1
cpus_per_node: 8
platform: linux
gpu: False
To account for slurm clusters complexity, you can add a path to save logs to, and the init
field allows to load modules and activate python environments.
calmip:
user: username
host: olympe.calmip.univ-toulouse.fr
cpus_per_node: 36
platform: slurm
wandb: offline
gpu: True
save_path: "/tmpdir/username/slurm_logs"
init: |
export OMP_NUM_THREADS=1
module purge
module load intel/18.2
module load intelmpi/18.2
module load python/3.6.8
source activate /tmpdir/username/envs/torchenv
cd
Localhost can be defined as a host to run tasks locally, commands will then be run directly in local CLI without ssh.
Each configuration can describe how many CPUs to allocate for the job depending on the platform, how long the job can be, what partition to use and if a GPU is required. If these parameters are not specified, they will be set to the default values.
Variables can be set in a playbook, and the string used as key will be replaced by the value everywhere at run time. This allows to easily run a whole playbook on a specific set of hosts, without changing each task, or to reuse a parameter in multiple tasks.
variables:
"£hosts":
- localhost
- cluster
tasks:
- name: "Run echo hostname"
hosts: "£hosts"
cmd:
- "echo hello $HOSTNAME"
A playbook is a set of tasks to run on specific hosts.
It starts with an inventory
field to specify the paths of hosts and configuration files. Then all tasks are run sequentially as described.
A cmd
task simply runs the commands on the remote server through ssh.
- name: "Run echo hostname"
hosts:
- cluster
cmd:
- "echo hello $HOSTNAME"
- "pwd"
The pip
field is a list of packages to install on the hosts with pip.
- name: "Test pip"
hosts:
- localhost
pip:
action: install
packages:
- numpy
- pandas
The git
field allows to push or pull from a directory on the remote host.
Its action
subfield can be set to "push" or "pull", default is "pull".
If github is not accessible, the code will hang and try 25 times to connect to github with a 5s delay. This was implemented to bypass some proxy settings that were annoying the author and about which the IT service never answered, forcing a more robust solution.
- name: "Git pull"
hosts:
- localhost
git:
action: pull
path: "~/Documents/Doctorat/Dev/BERL"
It can also be used to clone a repository.
- name: "Git clone"
hosts:
- localhost
git:
action: clone
path: "~/Documents/Doctorat/Dev/test_compute"
repo: git@github.com:TemplierPaul/BERL.git
install: False
The slurm
field allows to run a job on a specific partition on the remote host with slurm, and to do hyperparameter sweeps easily.
It can take the following parameters:
- cmd: the command to run on the remote host
- config: the configuration file to use
- seeds: the list of seeds to use, each job will run one experiment per seed sequentially
- sweep: dictionary {parameter: list of values} to sweep over
- args: list of arguments to pass to the command, one job will be started for each argument in the list
- scatter: if True, scatters the configurations to run between the hosts, else runs all configs on each host.
All slurm files generated are saved to the local archive
directory and named after their slurm job ID, and commands are also saved in the job.csv
file for archiving.
- name: "Test slurm"
hosts:
- cluster
slurm:
job_name: "Testing"
scatter: True
parallel: True
cmd: "python myscript.py"
config: test
seeds: 0, 1, 2, 3
sweep:
"--preset": ["canonical", "openai"]
"--n_evaluations": [1, 5, 8]
args:
- "run_1"
- "run_2"
- "run_3"
The tmux
field allows to run a job on a host that does not use Slurm by using tmux. It can be used to run multiple jobs in parallel by setting the parallel
field to True, else it will run one job at a time.
Similarly to srun
, it can take multiple args.
The cd
field can be used to start a tmux instance from a specific directory.
- name: "Test tmux"
hosts:
- cluster
tmux:
job_name: "Testing"
scatter: True
parallel: True
cd: "~/path/to/cd"
cmd: "python myscript.py"
seeds: 0
args:
- "run_1"
- "run_2"
- "run_3"
The copy
field allows to copy a file or directory between the remote host and localhost with scp.
To copy from localhost to remote host, the mode
field should be set to "push", else set it to "pull" to get the file from remote host.
- name: "Push file to a remote machine"
hosts:
- cluster
copy:
mode: "push"
src: "/path/to/local/file/filename.txt" # File
dest: "/path/to/save/to/on/cluster" # Directory
- name: "Copy a file from a remote machine"
hosts:
- cluster
copy:
mode: "pull"
src: "/path/to/remote/file/filename.txt" # File
dest: "/path/to/save/to/locally" # Directory
To set up this tool for you:
- Clone the repository: `git clone
- Install the needed libraries (namely
paramiko
) with pip - Setup a
hosts.yaml
file with the hosts you want to connect to - Make sure all hosts are accessible through SSH without a passphrase
- Setup a
configs.yaml
file with the configurations you can use - Create your playbooks in the
playbooks
directory
To start a playbook, just run:
python3 compute.py playbooks/<playbook-name>.yaml
To make the command global and simply type run playbook
from anywhere, add this to your .bashrc
:
run () {
cd <path-to-Compute>
python compute.py playbooks/$1.yaml
cd -
}
To add features, you can create new classes in python:
@register("cmd")
class CMD(Task):
def __init__(self, name, cfg, servers):
super().__init__(name, cfg, servers)
def run(self):
# Code to run the parsed task
cmd = self.cfg
if isinstance(cmd, str):
cmd = [cmd]
for server in self.servers:
for c in cmd:
print(f"> {c}")
out, err = server.run(c)
print(out, err)