Skip to content

gobbedy/slurm_simulation_toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

slurm_simulation_toolkit

By Guillaume Perrault-Archambault

Table of Contents

Disclaimer
Introduction
Requirements
Currently Supported Clusters
Install Instructions
Regression Setup Instructions
Regression Control File Syntax
Example: Launching a Regression
Example: Launching a Batch of Jobs
Description of each script
Features

The toolkit has been tested only on Compute Canada and Beihang clusters, and almost exclusively on GPU nodes.

Please open an issue if you find a bug or notice that the toolkit does not behave as intended.

This toolkit provides an automated command-line workflow for launching SLURM job regression tests (regression.sh ), monitoring these regressions (regression_status.sh), relaunching failed or incomplete jobs (relaunched_failed.sh), and post-processing regression logs to summarize results (using a custom hook in regression_status.sh).

The term 'regression' is short for 'regression test', which is a software industry term meaning a suite of tests used to verify code performance (see wikpedia's Regression Testing page).

The scripts were originally designed and tested using bash 4.3.48, and SLURM 17.11.12. These and newer versions of bash and SLURM are supported.

Older versions of bash/SLURM will likely work, but are not officially supported.

  • Graham
  • Cedar
  • Beluga
  • Niagara
  • Beihang Dell cluster (referred to as "Beihang" in the code)

The user can easily add support for a new cluster by modifying the script pointed to by SLURM_SIMULATION_TOOLKIT_GET_CLUSTER (see Regression Setup Instructions).

git clone https://github.com/gobbedy/slurm_simulation_toolkit <PATH_TO_TOOLKIT>

Every time you open a new shell, set and export the following environment variables:

  • SLURM_SIMULATION_TOOLKIT_HOME should be set to <PATH_TO_TOOLKIT> (the path to the installed toolkit).
  • SLURM_SIMULATION_TOOLKIT_SOURCE_ROOT_DIR is the path to the root source directory. See the Regression Control File Syntax section for more details.
  • SLURM_SIMULATION_TOOLKIT_REGRESS_DIR is the base directory beneath which simulation output directories and regression summary directories will be autogenerated. The default value in examples/example_master.rc is likely correct for most Compute Canada and Beihang users.
  • SLURM_SIMULATION_TOOLKIT_JOB_RC_PATH is the path to an RC file which contains default SLURM job parameters. Since these parameters can be overridden from the command-line, creating your own RC is not required. In other words, the default value in examples/example_master.rc does not need to be modified.
  • SLURM_SIMULATION_TOOLKIT_GET_CLUSTER points to a script that outputs the name of the local cluster. The default script pointed to in example_master.rc should be correct for Beihang and Compute Canada users.
  • SLURM_SIMULATION_TOOLKIT_SBATCH_SCRIPT_PATH is the path to the .sbatch file passed to the sbatch command. This file wraps the user's base script. The default script pointed to in example_master.rc is intended to be correct for most users, but will likely to not fit all usage models.
  • SLURM_SIMULATION_TOOLKIT_RESULTS_PROCESSING_FUNCTIONS is the path to a file containing two functions called by regression_status.sh for processing regression results. This can optionally be left unset, in which case the functions will not be called. An example can be found here ```examples/example_results_processing_functions.sh

You may set these variables by sourcing a master rc file in your shell.

An example master rc file setting all the above variables can be found here: <PATH_TO_TOOLKIT>/example_master.rc

You may copy <PATH_TO_TOOLKIT>/example_master.rc to any desired location <DESIRED_PATH_TO_RC> and modify the file contents as desired.

Then simply run: source <PATH_TO_RC>

WARNING: please do NOT store large amounts of data in the parent directory of your base script (including in any of its subdirectories), since this directory will be copied to the output directory for shapshotting.

For the same reason, please do NOT set SLURM_SIMULATION_TOOLKIT_REGRESS_DIR to any path under one of your source directories, otherwise it will constantly get copied into your output directories.

Any line whose first non-whitespace character is # is treated as a comment. Note that in-line comments are NOT supported (ie when # is not the first character)

Any line whose first non-whitespace character is @ is treated as a loop variable line. The syntax for loop variable lines is described in the Loop Variable Syntax section.

Any line containing only whitespace characters are ignored.

All other lines are batch control lines. The syntax for these lines is described in the Batch Control Syntax lines.

Batch Control Syntax

A batch control line has the following syntax: <RELATIVE PATH TO BASE SCRIPT> <BATCH OPTIONS> -- <BASE SCRIPT OPTIONS>

<RELATIVE PATH TO BASE SCRIPT> is the path of the user's script relative to $SLURM_SIMULATION_TOOLKIT_SOURCE_ROOT_DIR. In other words, the absolute path to the user's script is $SLURM_SIMULATION_TOOLKIT_SOURCE_ROOT_DIR/<RELATIVE PATH TO BASE SCRIPT>

<BATCH OPTIONS> are the options passed to the simulation_batch.sh script, such --num_simulations. Run simulation_batch.sh --help for a description of all options.

You may include most options supported by simulation_batch.sh in <BATCH OPTIONS>, with the notable exceptions of:

  • --regress_dir: this is handled automatically by the regression.sh script.
  • --max_jobs_in_parallel: this must be passed to regression.sh when launching a regression via regression.sh (vs simulation_batch.sh.
  • --preserve_order: this must also be passed to regression.sh.

Other exceptions are --hold and --singleton, which sctrictly speaking can be provided, but are unlikely to be useful in the context of launching a regression via regression.sh.

<BASE SCRIPT OPTIONS> are options passed down to the user's script.

Loop Variable Syntax

A loop variable line has the following syntax @<VAR1>[<VALUES1>],<VAR2>[<VALUES2>],<VAR3>[<VALUES3>],...,<VARN>[<VALUESN>]

where <VAR1>, <VAR2>, <VAR3>,...,<VARN> are loop variable names.

<VALUES1>, <VALUES2>, <VALUES3>,...,<VALUESN> are arrays of values. All <VALUES> arrays on the same line must have the same size.

<VALUES> arrays and can be specified in two different ways:

  1. <start>:<increment>:<stop> where <start> is the first value, <increment> is the increment value, and <stop> is the last value. Note that if is higher than , must be explicitly specified as negative.
  2. <val1>,<val2>,...,<valM> where <val1>,<val2>,...,<valM> are unique values, with no ordering constraints.

All variables on the same line are looped simultaneously.

If multiple loop variable lines are specified before a batch control line, varialbles on previous lines are nested in later lines.

Example Regression Control File

File: my_example.ctrl

## EXAMPLE CONTROL FILE, FILENAME: my_example.ctrl

# simple batch (no loop variables)
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.3 1.3

# batch loop using <start>:<increment>:<stop>
@alpha[1.3:0.2:1.9]
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters alpha alpha

# equivalent using <val1>,<val2>,...,<valM>
@alpha[1.3,1.5,1.7,1.9]
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters alpha alpha

# equivalent unrolled loop
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.3 1.3
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.5 1.5
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.7 1.7
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.9 1.9

# example of simultaneous loop
@alpha[1.3:0.2:1.9],beta[0.5,1.5,5.0,25.0]
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters alpha beta

# equivalent unrolled loop
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.3 0.5
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.5 1.5
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.7 5.0
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.9 25.0

# example of nested loop
@beta[0.5,1.5,5.0]
@alpha[1.3,1.5]
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters alpha beta

# equivalent unrolled loop
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.3 0.5
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.3 1.5
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.3 5.0
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.5 0.5
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.5 1.5
mixup_fun/train.py --num_simulations 6 -- --dat_transform --dat_parameters 1.5 5.0

The above control file assumes that the user's base script is located here: $SLURM_SIMULATION_TOOLKIT_SOURCE_ROOT_DIR/mixup/train.py

The user's script, in this example, accepts two arguments: --dat_transform, and --dat_parameters <val1> <val2>

Each setting is run 6 times. Assuming --num_proc_per_gpu 2 (see Launching a Regression), this would result in 3 SLURM jobs for each batch.

It goes without saying that the above control file is heavily redundant, launching identical batches for the sake of showing different methods for doing so.

regression.sh --max_jobs_in_parallel 8 --num_proc_per_gpu 2 --regresn_ctrl my_example.ctrl

This example launches the batches specified in the my_example.ctrl control file.

In this example, the number of running jobs at any given time is capped at 8, and the number of processes per GPU is set to 2. As a result, at most 16 simulations are run in parallel.

The --max_jobs_in_parallel can affect the time a job waits in the PENDING state on some systems (run regression.sh --help for details).

Sample output:

RUNNING:
regression.sh --max_jobs_in_parallel 8 --preserve_order --regresn_ctrl cifar_ce_sweep_best.ctrl

LAUNCHING REGRESSION:
Pre-processing control file...
Pre-processing took 0 seconds
Launching jobs...
Launching took 9 seconds
Releasing jobs...
Releasing took 0 seconds

SUMMARY FILES:
BATCH SCRIPT OUTPUT LOGFILE: /home/LAB/some_user/regress/Jul22_173044/regression_summary/batch_outputs.log
BATCH COMMAND MANIFEST: /home/LAB/some_user/regress/Jul22_173044/regression_summary/batch_command_manifest.txt
REGRESSION CANCELLATION SCRIPT: /home/LAB/some_user/regress/Jul22_173044/regression_summary/cancel_regression.sh
REGRESSION COMMAND FILE: /home/LAB/some_user/regress/Jul22_173044/regression_summary/regression_command.txt
REGRESSION CONTROL FILE (COPY): /home/LAB/some_user/regress/Jul22_173044/regression_summary/cifar_ce_sweep_best.ctrl
SIMULATION MANIFESTS: /home/LAB/some_user/regress/Jul22_173044/regression_summary/simulations_manifests.txt
HASH REFERENCES TO BATCH RUNS: /home/LAB/some_user/regress/Jul22_173044/regression_summary/hash_manifest.txt

ABOVE SUMMARY: /home/LAB/some_user/regress/Jul22_173044/regression_summary/summary.log

Run regression.sh --help for usage of the regression.sh script.

simulation_batch.sh --base_script /home/LAB/some_user/mixup_fun/train.py --regress_dir /home/LAB/some_user/regress/batch_test --job_name batch_test --num_simulations 12 --num_proc_per_gpu 2 --max_jobs_in_parallel 8 -- --batch_size 128 --epoch 200

The above assumes that the user's base script (located at /home/LAB/some_user/mixup_fun/train.py) accepts a --epochs <NUM_EPOCHS> option and a --batch_size <BATCH_SIZE> option.

Note that toolkit parameters (here --base_script, --regress_dir, and --num_simulations) are separated from base script parameters (here --epochs, --batch_size) with --.

Sample output:

JOB IDs FILE IN: /home/LAB/some_user/regress/batch_test/batch_test_4f66b261/batch_summary/job_manifest.txt
SIMULATION SCRIPT OUTPUT LOGFILE: /home/LAB/some_user/regress/batch_test/batch_test_4f66b261/batch_summary/simulation_output.log
BATCH CANCELLATION SCRIPT: /home/LAB/some_user/regress/batch_test/batch_test_4f66b261/batch_summary/cancel_batch.sh
BATCH COMMAND FILE: /home/LAB/some_user/regress/batch_test/batch_test_4f66b261/batch_summary/batch_command.txt
SLURM LOGFILES MANIFEST: /home/LAB/some_user/regress/batch_test/batch_test_4f66b261/batch_summary/slurm_log_manifest.txt
SIMULATION LOGS MANIFEST: /home/LAB/some_user/regress/batch_test/batch_test_4f66b261/batch_summary/log_manifest.txt
HASH REFERENCE FILE: /home/LAB/some_user/regress/batch_test/batch_test_4f66b261/batch_summary/hash_reference.txt
HASH REFERENCE: beihang@9c0ad17d

Run simulation_batch.sh --help for more details on usage.

slurm.sh

Wraps sbatch SLURM command. Also supports srun and salloc in theory, but only sbatch is thoroughly tested.

Handles low-level SLURM switches and parameters that do not need to be exposed to the user.

Run slurm.sh --help for usage.

simulation.sh

Wraps slurm.sh. Handles generating simulation output directory and copying source code to the output directory. The simulation will be run from within the output directory.

Run simulation.sh --help for usage.

simulation_batch.sh

Wraps simulation.sh. Handles launching multiple simulations in parallel. Will generate a regression summary directory containing job ID manifest, logfile manifest, slurm logfile manifest, batch cancellation script, batch command, hash reference file, hash reference, and a file containing the output of all the calls to simulation.sh.

Run simulation_batch.sh --help for usage.

regression.sh

Wraps simulation_batch.sh. Handles launching batches in parallel, each with potentially different base scripts. Will generate a regression summary directory containing a listing of batch log manifests, regression cancellation script, regression command, batch command manifest, a copy of the regression control file, and a file containing the output of all the calls to simulation_batch.sh.

Run regression.sh --help for usage.

relaunch_failed.sh

Handles relaunching any failed simulations in a completed (or still running) batch or regression. Also handles relaunching with checkpoint, if the failure was caused by reaching a SLURM time limit. Finally, also provides the same seed as the failing simulation. Note that checkpoint relaunch and seeding require a compatible user script, orelse will be ignored.

The original failed simulations will not be deleted, but they will no longer be tracked via regression_status.sh. The user should locate (via regression_status.sh or batch_status.sh) and analyze failed simulations before using this script.

Run relaunch_failed.sh --help for usage.

batch_status.sh

Uses the summary results generated by simulation_batch.sh to determine whether the batch has passed, failed, or is still running. Breaks down by jobs that are pending, running, successful, and failed.

For each job that has succeeded, will call the user's custom process_logfile function (if it exists).

If the batch has completed (all jobs completed, aka no jobs pending or running), will call the user's custom generate_summary function.

Sample output:

$ batch_status.sh -f /home/LAB/some_user/regress/Jul08_134202/dat_5c69aa5e/batch_summary/log_manifest.txt
Batch Summary Directory:
/home/LAB/some_user/regress/Jul08_134202/dat_5c69aa5e/batch_summary

REGRESSION FAILED: 2 simulations have errors.
Pending: 0
Running: 0
Successful: 98
Failed: 2

Failed jobs slurm logs: /home/LAB/some_user/regress/Jul08_134202/dat_5c69aa5e/batch_summary/error_manifest_slurm.txt
Failed simulation logs: /home/LAB/some_user/regress/Jul08_134202/dat_5c69aa5e/batch_summary/error_manifest.txt

Run batch_status.sh --help for usage.

regression_status.sh

Wraps batch_status.sh. Uses the summary results generated by regression.sh to determine whether the regression to break down batches into pending, running, passed, failed, and 'result failed', where 'result failed' signifies that the user's post-processing has failed. Also breaks down individual simulations into the same categories.

Sample output:

$ regression_status.sh -f /home/LAB/some_user/regress/Jul18_174725/regression_summary/simulations_manifests.txt
Processing arguments and preparing files...
Processing Regression...
REGRESSION SUMMARY DIR: /home/LAB/some_user/regress/Jul18_174725/regression_summary
----------------------------------------------------------------------------------------------
BATCHES:
PENDING: 5
RUNNING: 0
PASSED: 2
FAILED: 4
RESULT FAILED: 0
MANIFESTS OF PASSED BATCHES: /home/LAB/some_user/regress/Jul18_174725/regression_summary/passed_manifest_list.txt
RESULTS OF PASSED/FAILED BATCHES: /home/LAB/some_user/regress/Jul18_174725/regression_summary/results.txt
MANIFESTS OF FAILED BATCHES: /home/LAB/some_user/regress/Jul18_174725/regression_summary/failed_manifest_list.txt
----------------------------------------------------------------------------------------------
SIMULATIONS:
PENDING: 536
RUNNING: 16
PASSED: 506
FAILED: 42
RESULT FAILED (double counts passed/failed): 
MANIFEST OF RUNNING SIMS: /home/LAB/some_user/regress/Jul18_174725/regression_summary/running_sim_manifest.txt
MANIFEST OF PASSED SIMS: /home/LAB/some_user/regress/Jul18_174725/regression_summary/passing_sim_manifest.txt
MANIFEST OF FAILED SIMS: /home/LAB/some_user/regress/Jul18_174725/regression_summary/failing_sim_manifest.txt
----------------------------------------------------------------------------------------------

Run regression_status.sh --help for usage.

  • Parallel job launching: regression.sh can handle launching hundreds of jobs in parallel in seconds.
  • Regression relaunch: regression_relaunch.sh can be used to relaunch incomplete or interrupted regressions. Failed simulations will be restarted and interrupted simulations will pick off where they left off. Successful simulations will be kept, and not be relaunched.
  • Sandboxed simulations: all simulations are run in a separate autogenerated directories
    • Sandboxed simulations do not interfere with each other (eg scripts may write to a file with the same name in their output directory).
    • Snapshotting source code: The user's source code directory is copied to the autogenerated output, and it is this copied version which is executed. This flow ensures that users can continue editing their source code without affecting pending and running jobs.
    • Reproducible simulations: Snapshotting further ensures that simulations are fully reproducible, since all source code is snapshotted at time of the regression launch. The regression command, slurm commands and simulation output are all logged, allowing the user to retrieve any arguments and parameters used in a given simulation.
  • Regression monitoring: regression_status.sh automatically reports the status of a regression (running, completed, failed) with a breakdown of each job.
  • Results processing: the user can add functions called by regression_status.sh to process their regression results.
  • Argument cascading: arguments following -- are cascaded down to the user's base script, ensuring that the user does not need to modify the toolkit itself to pass down arguments.
  • Automatic generation of regression cancellation script: this autogenerated script kills the appropriate SLURM jobs if and when the user decides cancel their regression. This both saves the user's time in tracking down running jobs to cancel, as well as helps maximize the use of computer resources for other users.
  • Option to enforce a maximum of number of jobs in parallel for the current user. This is useful for SLURM systems that don't use a fairshare system (eg. in Beta testing phase of a new cluster.)
  • Option to run multiple simulations per GPU: this helps maximize use of compute resources when GPU memory exceeds the model's needs (eg I found that ResNet with 128 batch size uses fewer GPU hours when running 2 processes per GPU on a 32GB GPU).
  • Configurability: users can override default job parameters by supplying their own default job parameters, and use their own get_local_cluster.sh. See SLURM_SIMULATION_TOOLKIT_JOB_RC_PATH and SLURM_SIMULATION_TOOLKIT_GET_CLUSTER in the setup instructions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages