Gather without loading all dependencies at the same time #325

bart1 · 2018-03-15T12:30:06Z

I'm encountering the following issue I run simulations using drake. A the end of each simulation I get quite a big R6 object that I use drake to store. Afterwards I want to generate a pdf with exploratory plots for each simulation in a single pdf. Currently I gather all simulations in a list with gather and then plot from this list. This has the problem that it is not very scalable because of memory limitations with loading all simulation at the same time. I also makes me store all simulations twice in the cache, once in the list, and once individually.

This is some example code:

minExampleSims<-drake_plan(minExampleSims=simFun())%>% expand_plan(paste0('rep',1:5))
minSimsGather<-gather_plan(minExampleSims,'minSims','list')
resF<-'minRes.pdf'
makePdf<-drake_plan(minRes.pdf={pdf(resF)
	  for(sim_i in minSims)
	  {
		sim_i$visualizePlot()
		print( sim_i$visualizePlotMetricsWithRealData())
	  }
	  dev.off()
},file_targets=T)

I guess the same problem would occur when one fits a lot of model (e.g. gsp example) and want to use the default plot function on each model.

I wondered if I'm missing something or if this is not possible. An alternative approach would be to first reduce all the simulations (for example make plots using ggplot per simulation or extracting the necessary data) gather these and then plot them. This is not always a very easy option when preexisting plot functions are use.

I wonder if it would be useful/possible to have some kind of recursive version of gather that loads caches one by one

wlandau-lilly · 2018-03-15T13:14:14Z

Thanks for bringing this up, @bart1. It can be tricky to set up workflows when you cannot load everything into memory. In your case, I would recommend that you not gather these massive simulation objects. Even if drake someday has a more memory-efficient way of creating the gathered list, we cannot avoid the fact that the list will be too big to load into memory anyway. Below, I have sketched a modified workflow with no gathering. If you do want to gather things, I recommend you gather small summaries of the simulations rather than the simulations themselves.

library(drake)
library(magrittr)

sim_fun <- function(rep, ...) {
  data.frame(x = rnorm(25), y = rnorm(25))
}

save_plot <- function(data, file) {
  pdf(file)
  plot(y ~ x, data = data)
  dev.off()
}

plan <- drake_plan(sim = sim_fun("REP"), save_plot(sim_REP, file_out("sim_REP.pdf")), 
  strings_in_dots = "literals") %>% evaluate_plan(wildcard = "REP", values = paste0("rep", 
  1:5)) %>% print
#> # A tibble: 10 x 2
#>    target             command                                          
#>    <chr>              <chr>                                            
#>  1 sim_rep1           "sim_fun(\"rep1\")"                              
#>  2 sim_rep2           "sim_fun(\"rep2\")"                              
#>  3 sim_rep3           "sim_fun(\"rep3\")"                              
#>  4 sim_rep4           "sim_fun(\"rep4\")"                              
#>  5 sim_rep5           "sim_fun(\"rep5\")"                              
#>  6 "\"sim_rep1.pdf\"" "save_plot(sim_rep1, file_out(\"sim_rep1.pdf\"))"
#>  7 "\"sim_rep2.pdf\"" "save_plot(sim_rep2, file_out(\"sim_rep2.pdf\"))"
#>  8 "\"sim_rep3.pdf\"" "save_plot(sim_rep3, file_out(\"sim_rep3.pdf\"))"
#>  9 "\"sim_rep4.pdf\"" "save_plot(sim_rep4, file_out(\"sim_rep4.pdf\"))"
#> 10 "\"sim_rep5.pdf\"" "save_plot(sim_rep5, file_out(\"sim_rep5.pdf\"))"

vis_drake_graph(drake_config(plan))

make(plan)
#> target sim_rep1
#> target sim_rep2
#> target sim_rep3
#> target sim_rep4
#> target sim_rep5
#> target file "sim_rep1.pdf"
#> target file "sim_rep2.pdf"
#> target file "sim_rep3.pdf"
#> target file "sim_rep4.pdf"
#> target file "sim_rep5.pdf"

bart1 · 2018-03-15T14:17:54Z

Just as a small addition, with the merge_pdf function from tabulizer you can create one command to make a single pdf, here is an example how this all can be combined into one pdf

require(tabulizer)
gatherCmd<-data.frame(target="'comb.pdf'",command=paste('merge_pdfs(c(',paste('file_in("sim_rep',1:5,'.pdf")',sep='', collapse=','),"), file_out('comb.pdf'))"))
plan<-rbind(plan, gatherCmd)

wlandau · 2018-03-15T14:20:05Z

Nice! Have you heard of patchwork? I have never used it myself, but it sounds like another way to do this if you can use ggplot2.

bart1 · 2018-03-15T14:45:03Z

That looks like a nice package, have used some alternatives to that before just a quick example:

library(drake)
library(magrittr)
library(ggplot2)
require(patchwork)
sim_fun <- function(rep, ...) {
  data.frame(x = rnorm(25), y = rnorm(25))
}


sims <- drake_plan(sim = sim_fun("REP"), strings_in_dots = "literals") %>% 
evaluate_plan(wildcard = "REP", values = paste0("rep", 
                                                                                                     1:5)) %>% print
plots<-drake_plan(plt=ggplot(data=dataset__, aes(x=x,y=y))+geom_point()) %>% plan_analyses(sims)
comb<-data.frame(target="'plot.pdf'", 
command=paste('ggsave(',paste(plots$target, collapse=' + '),", file=file_out('plot.pdf'))"))
plan<-rbind(sims, plots,comb)
vis_drake_graph(drake_config(plan))
make(plan)

Would there be an alternative to generating the command with paste? or is that for the time being the most efficient?

wlandau · 2018-03-15T14:51:28Z

Maybe a new function like reduce_plan()? gather_plan() assumes you have a function like list() or rbind() and the targets you gather are named arguments separated by commas. A reduce_plan() function might look something like this:

plots
## # A tibble: 5 x 2
##   target       command                                                  
##   <chr>        <chr>                                                    
## 1 plt_sim_rep1 ggplot(data = sim_rep1, aes(x = x, y = y)) + geom_point()
## 2 plt_sim_rep2 ggplot(data = sim_rep2, aes(x = x, y = y)) + geom_point()
## 3 plt_sim_rep3 ggplot(data = sim_rep3, aes(x = x, y = y)) + geom_point()
## 4 plt_sim_rep4 ggplot(data = sim_rep4, aes(x = x, y = y)) + geom_point()
## 5 plt_sim_rep5 ggplot(data = sim_rep5, aes(x = x, y = y)) + geom_point()

reduce_plan(plots, op = "+", target = "reduced_target")
## # A tibble: 5 x 2
##   target         command                                                  
##   <chr>          <chr>
##   reduced_target plt_sim_rep1 + plt_sim_rep2 + plt_sim_rep3 + plt_sim_rep4 + plt_sim_rep5

I would definitely welcome a pull request with the implementation.

wlandau · 2019-01-26T18:00:10Z

Update: development drake now has a much friendlier (experimental) API. It is now easier to gather by specific groups. Details: https://ropenscilabs.github.io/drake-manual/plans.html#create-large-plans-the-easy-way. combine(.by = ...) is like dplyr::group_by().

wlandau · 2019-01-31T15:48:21Z

Edit: the link changed to https://ropenscilabs.github.io/drake-manual/plans.html#large-plans.

wlandau-lilly added topic: performance type: faq labels Mar 15, 2018

wlandau-lilly closed this as completed Mar 15, 2018

wlandau mentioned this issue Mar 15, 2018

New function reduce_plan() #326

Closed

wlandau added topic: api type: use case and removed type: faq topic: performance labels Jan 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gather without loading all dependencies at the same time #325

Gather without loading all dependencies at the same time #325

bart1 commented Mar 15, 2018

wlandau-lilly commented Mar 15, 2018

bart1 commented Mar 15, 2018

wlandau commented Mar 15, 2018

bart1 commented Mar 15, 2018

wlandau commented Mar 15, 2018 •

edited

Loading

wlandau commented Jan 26, 2019

wlandau commented Jan 31, 2019

Gather without loading all dependencies at the same time #325

Gather without loading all dependencies at the same time #325

Comments

bart1 commented Mar 15, 2018

wlandau-lilly commented Mar 15, 2018

bart1 commented Mar 15, 2018

wlandau commented Mar 15, 2018

bart1 commented Mar 15, 2018

wlandau commented Mar 15, 2018 • edited Loading

wlandau commented Jan 26, 2019

wlandau commented Jan 31, 2019

wlandau commented Mar 15, 2018 •

edited

Loading