Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm readLog() Error - Option to change fs.latency & scheduler.latency from batchtools_slurm or future::tweak #73

Closed
stuvet opened this issue May 26, 2021 · 10 comments

Comments

@stuvet
Copy link

stuvet commented May 26, 2021

After a lot of troubleshooting I've submiited a related bug report/feature request here at batchtools.

Long story short, jobs submitted via future.batchtools are timing out in readLog, even though the jobs do exist. This is resolved by altering scheduler.latency & fs.latency.

I'm setting up my futures like this:

slurm<-future::tweak(batchtools_slurm, 
                                   template = 'batchtools.slurm.tmpl', 
                                    workers = 4,
                                    resources = list(
                                       walltime = 7200,
                                       memory = 2048,
                                       ncpus = 4,
                                       ntasks = 4,
                                       partition = 'ph8' # GCE 'n2d-highcpu-8'
                           )
future::plan(slurm)

(unless I've done something stupid...) for future.batchtools to work reliably in my environment I need to be able to set fs.latency & scheduler.latency from future::tweak, or somewhere else. As far as I can see, these don't currently get passed through to batchtools::makeClusterFunctionsSlurm.

I'm currently getting around this problem by overwriting the default batchtools::makeClusterFunctionsSlurm with assignInNamespace. Setting 70 seconds for scheduler.latency and 10 seconds for fs.latency solves my problem and makes future.batchtools run jobs reliably desipite the provisioning of machines. Unfortunately this increases the delay for batchtools to recognise that the job has finished. No big deal for long-running jobs, but I've made a feature request at batchtools for the scheduler.latency option to be split, with a new one to cover the initial sleep.

Thanks

@HenrikBengtsson
Copy link
Collaborator

What is makeClusterFutureSlurm()?

@stuvet
Copy link
Author

stuvet commented May 27, 2021

Sorry, batchtools::makeClusterFunctionsSlurm

My apologies - it's very late/early here!

Corrected in the original post too.

@stuvet
Copy link
Author

stuvet commented Jun 5, 2021

After submitting a pull request for a bugfix in batchtools::waitForFile here the main issue mentioned above & here has been solved, while leaving the batchtools::makeClusterFunctionSlurm scheduler.latency at 1, but I definitely need to increase fs.latency (currently at 80) despite changing the nfs settings to sync in the slurm cluster as hinted at here.

I'm currently doing this by overwriting the default batchtools::makeClusterFunctionsSlurm with assignInNamespace as above.

It'd be great if future.batchtools had a feature to allow me to pass an argument for fs.latency through. Thanks for considering it.

@kkmann
Copy link

kkmann commented Jan 27, 2022

Hey,

I think I just ran into exactly the same problem. Would be great to have that "tweakable"

Thx

@stuvet
Copy link
Author

stuvet commented Jan 27, 2022

I had a terrible time trying to work this one out, but I summarised my problems & solutions here. I didn't end up needing the proposed fs.latency trick - it ended up being a workaround for this.

Though the solutions were posted in the ropensci/targets repo, all the errors were propagating from batchtools via future.batchtools.

Hopefully they'll apply to your situation too.

@kkmann
Copy link

kkmann commented Jan 27, 2022

thx, will look at it right away!

The problem seems to be that "..." cannot be used to pass additional arguments to the cluster function since it's called as

https://github.com/HenrikBengtsson/future.batchtools/blob/17adb2d383de2a909c4da2703b866f82d6b8f232/R/batchtools_template.R#L181

we would need an additional argument like .cluster_func_args or so...

@kkmann
Copy link

kkmann commented Jan 27, 2022

Uff, that must have been a journey... I am still struggling to see what the switch to basename did exactly. You just sped up the file lookup by 10x?

Would still be good to be able to adjust latency parameters to be on the safe side...

@stuvet
Copy link
Author

stuvet commented Jan 28, 2022

The benchmark figures were just to make sure I'd chosen the right option.

There real problem was a mismatch between fn passed to batchtools::waitForFile and the list of files to check in path. At least during startup fn was a full file path, while list.files(path, ...) only gave basenames without the full.names=TRUE argument. Though less than pretty I aimed for the PR to be robust to either full path or basename being passed to waitForFile.

Either way this mismatch led to the function waitForFile always failing to find the log file (at least when used during startup). If I remember correctly, this induced a race condition between the log file location being registered in the registry & the batchtools::readLog function being called - if no log file was registered before readLog was called it'd trigger waitForFile, which would always fail with the 'Log file .. for job with id .. not available', even though it was present on disk.

Before tracking down this issue I spent a while tweaking slurm fllesystem mount latencies with no success. I suspect those people running on Slurm clusters with plenty of idle workers would never see the error. I was trying to do Slurm on the cheap - provisioning physical nodes as needed - the provisioning delay meant it was quite likely I hit waitForFile when submitting hundreds of jobs to a Slurm partition which had 0 provisioned workers.

That PR (particularly when run on an underprovisioned Slurm partition) revealed another issue with the Slurm status codes not all being mapped in the default makeClusterFunctionsSlurm, but after both had been updated my targets pipelines have been rock solid for ~6mo when running via future.batchtools.

HenrikBengtsson added a commit that referenced this issue Dec 11, 2022
batchtools_template() and BatchtoolsSSHFuture() passes matching arguments in '...' to the underlying makeClusterFunctionsNnn() function, and the remaining to BatchtoolsFuture() [#73]
@HenrikBengtsson
Copy link
Collaborator

Better late than never ... This has been implemented for the next release, e.g.

> library(future.batchtools)
> plan(batchtools_sge, fs.latency = 42, scheduler.latency = 3.141)

> f <- future(42)
> str(f$config$cluster.functions)
List of 11
 $ name                : chr "SGE"
 $ submitJob           :function (reg, jc)  
 $ killJob             :function (reg, batch.id)  
 $ listJobsQueued      :function (reg)  
 $ listJobsRunning     :function (reg)  
 $ array.var           : chr NA
 $ store.job.collection: logi TRUE
 $ store.job.files     : logi FALSE
 $ scheduler.latency   : num 3.14
 $ fs.latency          : num 42
 $ hooks               : list()
 - attr(*, "class")= chr "ClusterFunctions"
 - attr(*, "template")= 'fs_path' chr "~/.batchtools.sge.tmpl"
> 

Until future.batchtools 0.11.0 is on CRAN, use:

remotes::install_github("HenrikBengtsson/future.batchtools", ref = "develop")

@HenrikBengtsson
Copy link
Collaborator

future.batchtools 0.11.0 fixing this is now on CRAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants