-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm readLog() Error - Option to change fs.latency & scheduler.latency from batchtools_slurm or future::tweak #73
Comments
What is |
Sorry, My apologies - it's very late/early here! Corrected in the original post too. |
After submitting a pull request for a bugfix in I'm currently doing this by overwriting the default It'd be great if |
Hey, I think I just ran into exactly the same problem. Would be great to have that "tweakable" Thx |
I had a terrible time trying to work this one out, but I summarised my problems & solutions here. I didn't end up needing the proposed Though the solutions were posted in the Hopefully they'll apply to your situation too. |
thx, will look at it right away! The problem seems to be that "..." cannot be used to pass additional arguments to the cluster function since it's called as we would need an additional argument like |
Uff, that must have been a journey... I am still struggling to see what the switch to basename did exactly. You just sped up the file lookup by 10x? Would still be good to be able to adjust latency parameters to be on the safe side... |
The benchmark figures were just to make sure I'd chosen the right option. There real problem was a mismatch between Either way this mismatch led to the function Before tracking down this issue I spent a while tweaking slurm fllesystem mount latencies with no success. I suspect those people running on Slurm clusters with plenty of idle workers would never see the error. I was trying to do Slurm on the cheap - provisioning physical nodes as needed - the provisioning delay meant it was quite likely I hit That PR (particularly when run on an underprovisioned Slurm partition) revealed another issue with the Slurm status codes not all being mapped in the default |
batchtools_template() and BatchtoolsSSHFuture() passes matching arguments in '...' to the underlying makeClusterFunctionsNnn() function, and the remaining to BatchtoolsFuture() [#73]
Better late than never ... This has been implemented for the next release, e.g. > library(future.batchtools)
> plan(batchtools_sge, fs.latency = 42, scheduler.latency = 3.141)
> f <- future(42)
> str(f$config$cluster.functions)
List of 11
$ name : chr "SGE"
$ submitJob :function (reg, jc)
$ killJob :function (reg, batch.id)
$ listJobsQueued :function (reg)
$ listJobsRunning :function (reg)
$ array.var : chr NA
$ store.job.collection: logi TRUE
$ store.job.files : logi FALSE
$ scheduler.latency : num 3.14
$ fs.latency : num 42
$ hooks : list()
- attr(*, "class")= chr "ClusterFunctions"
- attr(*, "template")= 'fs_path' chr "~/.batchtools.sge.tmpl"
> Until future.batchtools 0.11.0 is on CRAN, use: remotes::install_github("HenrikBengtsson/future.batchtools", ref = "develop") |
future.batchtools 0.11.0 fixing this is now on CRAN. |
After a lot of troubleshooting I've submiited a related bug report/feature request here at batchtools.
Long story short, jobs submitted via future.batchtools are timing out in readLog, even though the jobs do exist. This is resolved by altering scheduler.latency & fs.latency.
I'm setting up my futures like this:
(unless I've done something stupid...) for
future.batchtools
to work reliably in my environment I need to be able to setfs.latency
&scheduler.latency
from future::tweak, or somewhere else. As far as I can see, these don't currently get passed through tobatchtools::makeClusterFunctionsSlurm
.I'm currently getting around this problem by overwriting the default
batchtools::makeClusterFunctionsSlurm
with assignInNamespace. Setting 70 seconds forscheduler.latency
and 10 seconds forfs.latency
solves my problem and makes future.batchtools run jobs reliably desipite the provisioning of machines. Unfortunately this increases the delay for batchtools to recognise that the job has finished. No big deal for long-running jobs, but I've made a feature request atbatchtools
for the scheduler.latency option to be split, with a new one to cover the initial sleep.Thanks
The text was updated successfully, but these errors were encountered: