Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to speed up construction of a (very) large plan? #366

Closed
bmchorse opened this issue Apr 23, 2018 · 16 comments
Closed

How to speed up construction of a (very) large plan? #366

bmchorse opened this issue Apr 23, 2018 · 16 comments

Comments

@bmchorse
Copy link
Contributor

I'm tracking an analysis where a single run (of analysis.slurm) leads to ~300 output files (dataset_NUMBER_VARIABLE, where NUMBER is 1-100 replicates and VARIABLE is 4 different results files). They're just text files, but there are a lot of them when you multiply over 8 different datasets.

To handle the connection, I'm going about it as suggested in the FAQ about multiple output files, using wildcard templating to expand as follows:

# Note that I'm not using drake's HPC abilities to submit to slurm; 
# that's a project for another day.  Just trying to connect the in and 
# out files so that drake knows there's some dependency happening.

outfile_plan <- drake_plan(
      c(file_out("results/dataset1_NUMBER_VARIABLE"), 
       file_in("scripts/analysis1.slurm")),
      c(file_out("results/dataset2_NUMBER_VARIABLE"), 
       file_in("scripts/analysis2.slurm")))  # etc for the 8 datasets

outplan <- evaluate_plan(outfile_plan, 
                             rules=list(NUMBER = rep(1:100), 
                                        VARIABLE = c("var1.log", "var2.log",
                                                 "var3.log", "summary.txt")))

As you can imagine, this expansion takes quite awhile, as it leads to a plan tibble that's 2800 lines long. I think this plan evaluation has to happen every time I run the project, so it's not an insignificant time cost.

Are there ways to speed this up? A less structured way of doing this could be to simply link each analysis.slurm file to only a few output files (say, only files from replicate 1, or only var1.log) but that seems not ideal for reproducibility.

@wlandau
Copy link
Member

wlandau commented Apr 24, 2018

Gosh, I had no idea evaluate_plan() could be so slow! The bottleneck is inside a little utility function called file_outs_to_targets(), which looks for output file names and turns them into target names. So fortunately, workflow plan generation appears to be slow only if you have a ton of file_out()s.

drake_plan(whatever = file_out("file.txt"))
#> # A tibble: 1 x 2
#>   target         command             
#>   <chr>          <chr>               
#> 1 "\"file.txt\"" file_out("file.txt")

Digging deeper, it appears that command_dependencies() and code_dependencies() are both rather slow. Both analyze code and functions to find the dependencies of targets. The former processes text commands from workflow plans, and the latter deals with expressions and functions in general.

devtools::load_all("drake")
commands <- outplan$command # yours
exprs <- lapply(commands, function(cmd) parse(text = cmd)) # really fast
system.time(tmp <- lapply(commands, command_dependencies))
#>   user  system elapsed 
#>  6.348   0.023   6.371 
system.time(tmp <- lapply(exprs, code_dependencies))
#>   user  system elapsed 
#>  3.378   0.003   3.387 

The slowness of command_dependencies() relative to code_dependencies() appears mostly to be due to the old file API, which I am trying to phase out. A call to set_config() should drop the old file API and give you a little speedup for your R session.

pkgconfig::set_config("drake::strings_in_dots" = "literals")
system.time(tmp <- lapply(commands, command_dependencies))
#>   user  system elapsed 
#>  3.809   0.023   3.833 

All this will become moot if I can follow through with #350 or find some cleaner way to solve #283. But that's a difficult and messy problem, and it will be a while yet before I have enough uninterrupted time to attack it again.

By the way, you may not need to put all those output files in the plan. I would say you should declare a file_out() if the file

  1. Is a dependency of another target, or
  2. Is important enough that you want to regenerate a fresh clean copy if you mangle it by accident (e.g. with a text editor).

For (2), the workaround you mentioned unfortunately does not work. Manual post-make() changes to spatial_data.shx do not trigger the underlying command that actually produced it. For that, we need either one file per command or a solid solution to #283.

So in your case, maybe you won't get much mileage out of so many file targets. If they aren't meaningful to the results of your research, you could probably just leave them alone as side effects.

Drake tries to enhance reproducibility, but it's not a catch-all. #6 and #333 are examples where it falls short because of intentional deep-rooted design choices. This issue is another one. Drake is as R-focused as I could make it, so it mostly expects targets to be R objects. External files were really an afterthought, and playing catch-up is hard when thousands of lines of internals are already established. But I am trying. I really do want to solve #283.

P.S. I am kicking myself for having to say this, but you could probably just memoize drake_plan(). Either that or create a drake mini-project to create the plan, which could help if you create multiple sub-plans and row-bind them together at the end.

@wlandau
Copy link
Member

wlandau commented Apr 24, 2018

Also, I have to mention @kendonB here because of how massive his projects became. Last fall, it sounded like he really pushed the limits of both drake and his university's SLURM cluster. There were a bunch of things that helped: holding onto a config object with the graph, using different triggers, etc.

Another thing: if you do end up using drake's built-in HPC capabilities, you may want to take special steps to avoid submitting SLURM jobs just to fingerprint extra output files. Maybe run make(plan, parallelism = "future", jobs = 16, targets = real_hpc_targets) and then make(plan, parallelism = "mclapply", jobs = 8, targets = extra_output_files).

@kendonB
Copy link
Contributor

kendonB commented Apr 24, 2018

I never had trouble with speed when creating plans, only when making. My only suggestion in the short term would be to consider whether you have file targets that you can reasonably turn into regular targets. Unless you're transferring data for use in other software, I can't immediately think of a use case for using file targets for intermediate data.

@bmchorse
Copy link
Contributor Author

Interesting, I didn't realize one shouldn't always set file targets if there were files involved - I just assumed that if there were files, they should be in the plan! Is there a 'best practice' here that we can add to the best practices document? I'd be interested in discussion of when to make things regular targets vs. file targets as part of that (you allude to this a bit below).

By the way, you may not need to put all those output files in the plan. I would say you should declare a file_out() if the file

  1. Is a dependency of another target, or
  2. Is important enough that you want to regenerate a fresh clean copy if you mangle it by accident (e.g. with a text editor).

&

Unless you're transferring data for use in other software, I can't immediately think of a use case for using file targets for intermediate data.

I think neither of @wlandau's conditions exactly applies, but I'm not sure. These are important results - in particular, one of the four output files is an MCMC log that goes forward to get combined across replicates, produce plots, etc. So in that sense they're important, but in another, at no point should these files be getting edited, so chances of them getting mangled seem pretty low. Particularly since I am my only collaborator on this analysis.

Another thing: if you do end up using drake's built-in HPC capabilities, you may want to take special steps to avoid submitting SLURM jobs just to fingerprint extra output files.

The output files all come out at once, so no worries there! I've parallelized the 100 jobs, and each job of the 100 will output 4 text files from a single command, so that's why it gets so voluminous.

Do you think it might be better to not have these files connected to analysis.slurm at all, then? Let it dead-end there, and then either have the text files as file_in() targets to their future analysis or just as arguments to a function, to sort of 'skip' the step that has 2800 files? (To be fair, only ~800 of them will go forward to future analysis.)

@wlandau I see your point about the intentional choices for drake internals! This doesn't bother me too much as I'm already 'cheating' by not using drake for the HPC part; I'm interested in extremely robust drake usage for data cleaning (and later for plotting), and I've accepted that - for this project! - the actual generation and processing of results files will get a little messy because it's not R-based. It seems from what you're saying that perhaps the unfortunately enormous file outputs will just have to be part of the messiness, which is fine with me if that is the best way forward.

I think that even without an easy end-to-end drake::make() to reconstruct the entire project from scratch, the reproducibility on this will be pretty high by virtue of the entire thing being in a github repo, more or less laid out as a drake project, and hopefully having a solid README.

@wlandau
Copy link
Member

wlandau commented Apr 24, 2018

Interesting, I didn't realize one shouldn't always set file targets if there were files involved - I just assumed that if there were files, they should be in the plan! Is there a 'best practice' here that we can add to the best practices document?

Good point, all this definitely deserves explicit attention in the best practices guide.

I'd be interested in discussion of when to make things regular targets vs. file targets as part of that (you allude to this a bit below).

I agree with @kendonB on this, and it deserves to be part of that discussion. Things just gets tricky when we have no choice but to work with files generated by other software. We should probably mention other workarounds. In some cases, you might be able to combine your files.

drake_plan({
  processx::run("sbatch", c("--wait", file_in("scripts/analysis1.slurm"))) # Generate those logs.
  processx::run("sed", paste("r var1.log var2.log var3.log >", file_out("combined1.log")))
})
#> # A tibble: 1 x 2   target             command
#>   <chr>               <chr>
#> 1 "\"combined1.log\"" "{\n    system2(\"sbatch\", c(\"--wait\", file_in(\"s…

Even better for drake projects, you could end your command by reading all the important logs back into R.

drake_plan(
  results1 = {
    processx::run("sbatch", c("--wait", file_in("scripts/analysis1.slurm")))
    list(
      estimates = read_table("parameter_estimates1.log"), # no file_in() anywhere
      mcmc_samples = read_table("mcmc1.log"), # table of MCMC parameters
      other = readLines("other1.log") # just plain text
    )
  }
)

I wonder if this would work for @tiernanmartin's spatial data project in #257. Probably a long shot, I am not sure if those files spatial files can or should be read back into R.

I am wondering if there exists a general way to cache an arbitrary file system as a serialized R object. Assuming performance isn't too terrible, tool like that could really help drake.

plan <- drake_plan(
  results1 = {
    processx::run("sbatch", c("--wait", file_in("scripts/analysis1.slurm")))
    cache_files("parameter_estimates1.log", "mcmc1.log", "other1.log")
  }
)
make(plan)
unlink("*.log") # Doesn't matter.
uncache_files(readd(results1)) # Recovers the *1.log files.

Do you think it might be better to not have these files connected to analysis.slurm at all, then? Let it dead-end there, and then either have the text files as file_in() targets to their future analysis or just as arguments to a function, to sort of 'skip' the step that has 2800 files? (To be fair, only ~800 of them will go forward to future analysis.)

My suggestion is to trim down the logs where you can, especially dead-end files. Alternatively, you could have two back-to-back drake pipelines, the second beginning with the output files of the first. Here, you may want to use a different cache for each to avoid colliding target names (see new_cache() and storr::storr_rds()). For projects like yours that combine so many different technologies, drake may only be part of the pipeline, and that is perfectly okay. In fact, it is part of the design.

Drake's HPC capabilities are diverse, but parallel efficiency still needs work (re: #285). It is one of the biggest mountains I am climbing right now.

@wlandau
Copy link
Member

wlandau commented Apr 24, 2018

So going forward with this issue: I say we expand the best practices guide to clarify that not all files need to be targets and to provide some more workarounds. I think I will have enough time by next week. After that, I am not sure I can speed up code_dependencies() enough to solve the original problem, so I think we should defer to #283.

@bmchorse
Copy link
Contributor Author

I think having back-to-back drake pipelines might make the most sense. In my mind, the project has three parts:

  1. Setup, preprocessing, cleaning. (All in drake)
2) Plotting and analyzing results logfiles, and processing for the next analysis step. (Can be made to fit `drake`)
~~~~ cluster magic ~~~~
3) Plotting and analyzing the final results files. (Can also fit in `drake`)

In this case, perhaps three back-to-back pipelines is ideal. Downsides: a little complicated and it breaks the elegant complete dependency-management that `drake` is wonderful for. Upsides: the role of `drake` is clearly defined. And, as a human, I understand that when the end of one pipeline is out-of-date, downstream pipelines are also out-of-date, so I can rerun the cluster magic and force rebuilding on the downstream pipeline. 

I will look into the methods you mentioned for avoiding cache collisions.  Would I want three separate R projects, or could I have a single R project? I imagine I could have one make.R file that calls `drake::make()` on three separate plans (might keep those plan files separate for each pipeline stage) and clears the cache in between?

> I say we expand the best practices guide to clarify that not all files need to be targets and to provide some more workarounds.

I think this is a good plan. Let me know if there's anything I can do to help! I think it would be good to point out some common use cases (when to set files as targets vs. not, what to do if your analysis interfaces with non-R pieces, what to do when your non-R analysis generates a huge number of files...) with some suggested workarounds. 

@wlandau
Copy link
Member

wlandau commented Apr 26, 2018

Would I want three separate R projects, or could I have a single R project? I imagine I could have one make.R file that calls drake::make() on three separate plans (might keep those plan files separate for each pipeline stage) and clears the cache in between?

Yes, a master make.R file sounds like a good idea. And with three separate caches and three separate plans, I think you can work within single directory/project. The next best alternative is to have three different file systems, but I think that would get messy. Just be sure to supply the appropriate caches to any drake_config() objects you create for visualization, etc. Whatever you choose, I strongly advise you to not clear the cache(s) between stages unless you want to trigger the same cluster magic all over again for a future runthrough.

I think it would be good to point out some common use cases (when to set files as targets vs. not, what to do if your analysis interfaces with non-R pieces, what to do when your non-R analysis generates a huge number of files...) with some suggested workarounds.

Absolutely. In fact, I think there is enough here to fill an entire vignette.

@bmchorse
Copy link
Contributor Author

Just be sure to supply the appropriate caches to any drake_config() objects you create for visualization, etc. Whatever you choose, I strongly advise you to not clear the cache(s) between stages unless you want to trigger the same cluster magic all over again for a future runthrough.

I see - I think I misunderstood. So I want to assign a separate cache to each drake config object, not clear the caches! How do I tell drake to use a separate (presumably recoverable) cache for each one? I read the storage/caching page you linked, but it's unclear to me how I'd 'assign' a cache to a project. Can I name them?

@wlandau
Copy link
Member

wlandau commented Apr 26, 2018

You can definitely name them, and I realize now that I glossed over a lot. Drake uses storr for caching (specifically, RDS driver for thread safety). You can create a cache with either storr::storr_rds(mangle_key = TRUE) or drake::new_cache().

my_cache <- new_cache(path = "my_cache")
list.files("my_cache") # internal storr files

Each storr cache has a file system (or environment in the case of storr_environment()) and an R6 object with the cache's API.

my_cache$list() # empty
make(plan, cache = my_cache)
my_cache$list() # names of targets and imports
my_cache$get("my_target") # just like readd()
config <- drake_config(plan, cache = my_cache)
vis_drake_graph(config)
file.exists(".drake") # All this time, we have been using `my_cache` as the cache instead of the default folder.
my_cache$destroy() # Erase your results.

All this probably belongs at the beginning of the storage vignette.

@bmchorse
Copy link
Contributor Author

This makes sense! So you can assign

my_cache <- new_cache(path = "my_cache")

and then refer to my_cache in the future.

One last question (I think) - how do you refer to the 'default' cache that got created when no cache was specified? Can I simply leave it unspecified in any drake_config() and vis_drake_graph() calls for which I want it to access the original cache, and then specify cache = my_cache for the other pipeline?

@wlandau
Copy link
Member

wlandau commented Apr 27, 2018

Glad you asked. You can get the default cache a few different ways.

  • drake::get_cache()
  • drake::get_cache(path = "my_project_root"). Here, the cache is a .drake folder in my_project_root or one of its ancestors. get_cache() searches up though the directory tree starting at my_project_root until it finds a .drake folder.
  • drake::this_cache(path = ".drake"). this_cache() uses the literal path you give it without assuming a .drake folder or searching for one.
  • config$cache, assuming you created config with drake_config() and left the cache argument unspecified.
  • storr::storr_rds(path = ".drake", mangle_key = TRUE)

And yes, you can use the default cache for one drake_config() list and a custom one for another.

cache2 <- new_cache("cache2")
config1 <- drake_config(plan1) # .drake/
config2 <- drake_config(plan2, cache = cache2)
vis_drake_graph(config1)
vis_drake_graph(config2)

@wlandau
Copy link
Member

wlandau commented Apr 27, 2018

Oops. Sorry, I think I just gave you bad advice about get_cache(). Drake has functions get_cache() and this_cache(), and they are a bit different.

  • get_cache(path = "my_path") assumes my_path is a project root containing a .drake folder. If it does not find a .drake folder in my_path, it searches up through the ancestors of my_path until it finds one.
  • this_cache(path = "my_path") literally assumes my_path is the path to the cache, .drake folder or not.

The names are not obvious because I was going for different things at different times. Early on, I wanted to search up the directory tree for the .drake/ folder to make sure loadd() and readd() worked from subdirectories. Only later did it occur to me to let users choose how to name the cache folders.

Anyway, I changed my original response, and I updated storage vignette with improved guidance in the first section. I hope it helps.

@wlandau wlandau changed the title How to speed up evaluation of a (very) large plan? How to speed up construction of a (very) large plan? May 2, 2018
@wlandau wlandau closed this as completed in 119fe7d May 2, 2018
@wlandau
Copy link
Member

wlandau commented May 2, 2018

The best practices guide now has detailed guidance on output file targets. I think I covered most of it, and I will eagerly review any pull requests with suggestions.

@bmchorse
Copy link
Contributor Author

bmchorse commented May 2, 2018

Looks great!

@wlandau
Copy link
Member

wlandau commented Jul 15, 2018

I almost forgot about this thread. Just solved #283 via #469 yesterday. Plans with lots of files should be much faster to create now. Will release to CRAN sometime this month.

@wlandau wlandau removed the type: faq label Dec 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants