Skip to content

Repo to demonstrate how to launch parallelized array job in Slurm

Notifications You must be signed in to change notification settings

lionelvoirol/demo_array_job_slurm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Launching array job on a slurm HPC cluster

Consider the task of launching a simulation study that repeats n_simu times a simulation that include a stochastic process and that save a given vector of results theta. One can efficiently parallelize such a simulation study using array jobs in a slurm cluster. The Slurm Workload Manager , formerly known as Simple Linux Utility for Resource Management, or simply Slurm, is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

A note on High Performance Computing and parallelism

You can find an introduction to High Performance Computing (HPC) and a Baobab Hello World here as well as an introduction to parallel computing on baobab here, on the Data Analytics Lab's blog page. Also find the various ressources:

Notes

  • This demo is assuming that the user is aiming to parallelize the execution of simulation study using R.
  • This demo is assuming that the user is using the University of Geneva's clusters baobab or yggdrasil.
  • This demo is assuming that the user have a University of Geneva email adress.
  • All commands are assumed to be performed on a linux command line that have slurm installed.

Creating file tree

Locate your $HOME directory with:

echo $HOME

Create the following file tree in the $HOME directory.

├── my_simu
│   ├── data_temp
│   ├── report
│   ├── outfile

bash commands

mkdir my_simu
cd my_simu
mkdir data_temp
mkdir report
mkdir outfile

The simulation

We consider the simulation study of generating n_simu of a sample of size sample_size of $X_i$ where $X\sim \mathcal{N}(0,1)$ and computing the mle estimator of the mean $$\hat{\mu} = \bar{x}=\frac{1}{n} \sum_{1=i}^n X_i$$

and its standard deviation (using Bessel's bias correction)

$$\hat{\sigma}^2=\frac{1}{n-1} \sum_{1=i}^n (X_i-\bar{x})^2$$

Organising simulation by array

Consider that you want to run n_simu simulations using n_array arrays. In order to generate reproducible results, you can set a simulation seed using set.seed(i_seed). In order to perform a simulation study of n_simu simulations, you can generate n_simu seeds that you save in a You create the matrix of simulation seeds with:

# define number of simulations and arrays
n_simu <- 100000
n_array <- 100

# create matrix of indices
ind_mat <- matrix(1:n_simu, nr = n_array, byr = T)

.R file

Save this file in your $HOME directory as my_simu.R

# define number of simulations and arrays
n_simu <- 100000
n_array <- 100

# create matrix of indices
ind_mat <- matrix(1:n_simu, nr = n_array, byr = T)

# get slurm array id and convert to numeric
id_slurm <- Sys.getenv("SLURM_ARRAY_TASK_ID")
id_slurm <- as.numeric(id_slurm)

# define id of simu to be run on array
id_simu <- ind_mat[id_slurm, ]

# meta parameters
sample_size = 100000
verbose=T

# define matrix of results
mat_results = matrix(NA, ncol = 2, nrow=length(id_simu))

for(simu_index in seq(length(id_simu))){
  # get seed 
  i_seed=id_simu[simu_index]
  # set seed
  set.seed(i_seed)
  # generate data
  sample = rnorm(n = sample_size)
  # compute theta
  theta = c(mean(sample),
            sd(sample))
  # save in matrix
  mat_results[simu_index, ] = theta
  # print status if verbose
  if(verbose & simu_index %% 2 == 0){
    print(simu_index)
  }
}

# define file name
name_file <- paste0("my_simu/data_temp/", "mat_results", "_id_", id_slurm, ".rda")

# print file name
print(name_file)

# save
save(mat_results, file=name_file)

.sh file

We define the file that will launch my_simu.R. Save this file in your $HOME directory as launch_my_simu.sh

#!/bin/bash
#SBATCH --job-name=my_simu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --partition=public-cpu,public-bigmem,public-longrun-cpu,shared-cpu,shared-bigmem
#SBATCH --mail-user=lionel.voirol@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --output my_simu/outfile/outfile_%a.out
module load GCC/10.2.0 OpenMPI/4.0.5 R/4.0.4
INFILE=my_simu.R
OUTFILE=my_simu/report/report_my_simu_${SLURM_ARRAY_TASK_ID}.Rout
srun R CMD BATCH $INFILE $OUTFILE

The recombination and cleaning script

The recombination and cleaning script performs the followng tasks:

  • recombine results from all arrays and save the complete matrix of results under my_simu/mat_result_simulation_date_time.rda
  • check if there are some simulation for which the resulting .rda file is not found and store a log under my_simu/my_simu/failed_array_date_time.txt
  • clean the my_simu/data_temp directory
  • clean the my_simu/outfile directory

.R file

Save this file in your $HOME directory as recombine_and_clean_folders_my_simu.R

# recombine all array jobs
all_files = list.files(path = "my_simu/data_temp")
mat_result_simulation = matrix(ncol=2)
for(file_i in all_files){
  file_name = paste0("my_simu/data_temp/",file_i)
  load(file_name)
  mat_result_simulation = rbind(mat_result_simulation, mat_results)
}
mat_result_simulation = mat_result_simulation[-1,]

# print dimension of matrix of results
dim(mat_result_simulation)

# save matrix of results
time = Sys.time()
time_2 = gsub(" ", "_", time)
time_3 = gsub(":", "-", time_2)
file_name_to_save = paste0(paste("my_simu/mat_result_simulation", time_3, sep="_"),
                           ".rda")
print(file_name_to_save)
save(mat_result_simulation, file=file_name_to_save)

# define function to check which files were not computed and save
check_which_file_computed = function(directory, range, file_name, extension = ".rda"){
  all_present_file = list.files(directory)
  all_suposed_file = paste0(file_name, "_",range, extension)
  not_found_file = all_suposed_file[which(!all_suposed_file %in% all_present_file)]
  time = Sys.time()
  time_2 = gsub(" ", "_", time)
  time_3 = gsub(":", "-", time_2)
  file_name = paste0(paste("my_simu/failed_array", time_3, sep="_"),
                     ".txt")
  write.table(x = not_found_file, file = file_name, sep="\t")
}

# check which files were not computed and save
check_which_file_computed(directory="my_simu/data_temp", 
                          range=1:100, file_name = "mat_results_id")

delete all rda file and all outfile
unlink("my_simu/data_temp/*", recursive = T, force = T)
unlink("my_simu/outfile/*", recursive = T, force = T)

.sh file

Save this file in your $HOME directory as launch_recombine_my_simu.sh

#!/bin/bash
#SBATCH --job-name=recombine_my_simu
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
#SBATCH --partition=public-cpu,public-bigmem,public-longrun-cpu,shared-cpu,shared-bigmem
#SBATCH --mail-user=lionel.voirol@unige.ch
#SBATCH --mail-type=ALL
#SBATCH --output my_simu/outfile/outfile_recombine.out
module load GCC/9.3.0 OpenMPI/4.0.3 R/4.0.0
INFILE=recombine_and_clean_folders_my_simu.R
OUTFILE=my_simu/report/recombine_and_clean_folders.Rout
srun R CMD BATCH $INFILE $OUTFILE

Launching the simulation and recombination and cleaning script

Make sure to have the following file tree before launching the simulation:

my_simu.R
launch_my_simu.sh
recombine_and_clean_folders_my_simu.R
launch_recombine_my_simu.sh
├── my_simu
│   ├── data_temp
│   ├── report
│   ├── outfile

Launch the array job with

sbatch --array=1-100 launch_my_simu.sh

slurm will then retun something like:

Submitted batch job job_id

You then submit the recombination and cleaning script with

sbatch --dependency=afterany:job_id launch_recombine_my_simu.sh 

⚠️ make sure to change the corresponding job_id.

You can check if the array task is launched with:

squeue -u username

⚠️ make sure to change the corresponding username.

You should see something like:

       8602501_995 shared-cp  my_simu username  R       0:05      1 cpu117
       8602501_996 shared-cp  my_simu username  R       0:05      1 cpu117
       8602501_997 shared-cp  my_simu username  R       0:05      1 cpu117
       8602501_998 shared-cp  my_simu username  R       0:05      1 cpu117
       8602501_999 shared-cp  my_simu username  R       0:05      1 cpu117
      8602501_1000 shared-cp  my_simu username  R       0:05      1 cpu117

That's it! Once all arrays of the simulation are computed, the recombination and cleaning script will be launched. You will then find your results under my_simu. More specifically,

  • The matrix of all results under my_simu/mat_result_simulation_date_time.rda
  • The .Rout for all array job under my_simu/report
  • The list of potential array that did not computed under my_simu/failed_array_date_time.txt

Well done! 🤓 😎

About

Repo to demonstrate how to launch parallelized array job in Slurm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published