-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to speed up construction of a (very) large plan? #366
Comments
Gosh, I had no idea drake_plan(whatever = file_out("file.txt"))
#> # A tibble: 1 x 2
#> target command
#> <chr> <chr>
#> 1 "\"file.txt\"" file_out("file.txt") Digging deeper, it appears that devtools::load_all("drake")
commands <- outplan$command # yours
exprs <- lapply(commands, function(cmd) parse(text = cmd)) # really fast
system.time(tmp <- lapply(commands, command_dependencies))
#> user system elapsed
#> 6.348 0.023 6.371
system.time(tmp <- lapply(exprs, code_dependencies))
#> user system elapsed
#> 3.378 0.003 3.387 The slowness of pkgconfig::set_config("drake::strings_in_dots" = "literals")
system.time(tmp <- lapply(commands, command_dependencies))
#> user system elapsed
#> 3.809 0.023 3.833 All this will become moot if I can follow through with #350 or find some cleaner way to solve #283. But that's a difficult and messy problem, and it will be a while yet before I have enough uninterrupted time to attack it again. By the way, you may not need to put all those output files in the plan. I would say you should declare a
For (2), the workaround you mentioned unfortunately does not work. Manual post- So in your case, maybe you won't get much mileage out of so many file targets. If they aren't meaningful to the results of your research, you could probably just leave them alone as side effects.
P.S. I am kicking myself for having to say this, but you could probably just memoize |
Also, I have to mention @kendonB here because of how massive his projects became. Last fall, it sounded like he really pushed the limits of both Another thing: if you do end up using |
I never had trouble with speed when creating plans, only when making. My only suggestion in the short term would be to consider whether you have file targets that you can reasonably turn into regular targets. Unless you're transferring data for use in other software, I can't immediately think of a use case for using file targets for intermediate data. |
Interesting, I didn't realize one shouldn't always set file targets if there were files involved - I just assumed that if there were files, they should be in the plan! Is there a 'best practice' here that we can add to the best practices document? I'd be interested in discussion of when to make things regular targets vs. file targets as part of that (you allude to this a bit below).
&
I think neither of @wlandau's conditions exactly applies, but I'm not sure. These are important results - in particular, one of the four output files is an MCMC log that goes forward to get combined across replicates, produce plots, etc. So in that sense they're important, but in another, at no point should these files be getting edited, so chances of them getting mangled seem pretty low. Particularly since I am my only collaborator on this analysis.
The output files all come out at once, so no worries there! I've parallelized the 100 jobs, and each job of the 100 will output 4 text files from a single command, so that's why it gets so voluminous. Do you think it might be better to not have these files connected to @wlandau I see your point about the intentional choices for I think that even without an easy end-to-end |
Good point, all this definitely deserves explicit attention in the best practices guide.
I agree with @kendonB on this, and it deserves to be part of that discussion. Things just gets tricky when we have no choice but to work with files generated by other software. We should probably mention other workarounds. In some cases, you might be able to combine your files. drake_plan({
processx::run("sbatch", c("--wait", file_in("scripts/analysis1.slurm"))) # Generate those logs.
processx::run("sed", paste("r var1.log var2.log var3.log >", file_out("combined1.log")))
})
#> # A tibble: 1 x 2 target command
#> <chr> <chr>
#> 1 "\"combined1.log\"" "{\n system2(\"sbatch\", c(\"--wait\", file_in(\"s… Even better for drake_plan(
results1 = {
processx::run("sbatch", c("--wait", file_in("scripts/analysis1.slurm")))
list(
estimates = read_table("parameter_estimates1.log"), # no file_in() anywhere
mcmc_samples = read_table("mcmc1.log"), # table of MCMC parameters
other = readLines("other1.log") # just plain text
)
}
) I wonder if this would work for @tiernanmartin's spatial data project in #257. Probably a long shot, I am not sure if those files spatial files can or should be read back into R. I am wondering if there exists a general way to cache an arbitrary file system as a serialized R object. Assuming performance isn't too terrible, tool like that could really help plan <- drake_plan(
results1 = {
processx::run("sbatch", c("--wait", file_in("scripts/analysis1.slurm")))
cache_files("parameter_estimates1.log", "mcmc1.log", "other1.log")
}
)
make(plan)
unlink("*.log") # Doesn't matter.
uncache_files(readd(results1)) # Recovers the *1.log files.
My suggestion is to trim down the logs where you can, especially dead-end files. Alternatively, you could have two back-to-back
|
So going forward with this issue: I say we expand the best practices guide to clarify that not all files need to be targets and to provide some more workarounds. I think I will have enough time by next week. After that, I am not sure I can speed up |
I think having back-to-back
|
Yes, a master
Absolutely. In fact, I think there is enough here to fill an entire vignette. |
I see - I think I misunderstood. So I want to assign a separate cache to each |
You can definitely name them, and I realize now that I glossed over a lot. my_cache <- new_cache(path = "my_cache")
list.files("my_cache") # internal storr files Each my_cache$list() # empty
make(plan, cache = my_cache)
my_cache$list() # names of targets and imports
my_cache$get("my_target") # just like readd()
config <- drake_config(plan, cache = my_cache)
vis_drake_graph(config)
file.exists(".drake") # All this time, we have been using `my_cache` as the cache instead of the default folder.
my_cache$destroy() # Erase your results. All this probably belongs at the beginning of the storage vignette. |
This makes sense! So you can assign my_cache <- new_cache(path = "my_cache") and then refer to One last question (I think) - how do you refer to the 'default' cache that got created when no cache was specified? Can I simply leave it unspecified in any |
Glad you asked. You can get the default cache a few different ways.
And yes, you can use the default cache for one cache2 <- new_cache("cache2")
config1 <- drake_config(plan1) # .drake/
config2 <- drake_config(plan2, cache = cache2)
vis_drake_graph(config1)
vis_drake_graph(config2) |
Oops. Sorry, I think I just gave you bad advice about
The names are not obvious because I was going for different things at different times. Early on, I wanted to search up the directory tree for the Anyway, I changed my original response, and I updated storage vignette with improved guidance in the first section. I hope it helps. |
The best practices guide now has detailed guidance on output file targets. I think I covered most of it, and I will eagerly review any pull requests with suggestions. |
Looks great! |
I'm tracking an analysis where a single run (of
analysis.slurm
) leads to ~300 output files (dataset_NUMBER_VARIABLE
, whereNUMBER
is 1-100 replicates andVARIABLE
is 4 different results files). They're just text files, but there are a lot of them when you multiply over 8 different datasets.To handle the connection, I'm going about it as suggested in the FAQ about multiple output files, using wildcard templating to expand as follows:
As you can imagine, this expansion takes quite awhile, as it leads to a plan tibble that's 2800 lines long. I think this plan evaluation has to happen every time I run the project, so it's not an insignificant time cost.
Are there ways to speed this up? A less structured way of doing this could be to simply link each
analysis.slurm
file to only a few output files (say, only files from replicate 1, or only var1.log) but that seems not ideal for reproducibility.The text was updated successfully, but these errors were encountered: