Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather without loading all dependencies at the same time #325

Closed
bart1 opened this issue Mar 15, 2018 · 7 comments
Closed

Gather without loading all dependencies at the same time #325

bart1 opened this issue Mar 15, 2018 · 7 comments

Comments

@bart1
Copy link

bart1 commented Mar 15, 2018

I'm encountering the following issue I run simulations using drake. A the end of each simulation I get quite a big R6 object that I use drake to store. Afterwards I want to generate a pdf with exploratory plots for each simulation in a single pdf. Currently I gather all simulations in a list with gather and then plot from this list. This has the problem that it is not very scalable because of memory limitations with loading all simulation at the same time. I also makes me store all simulations twice in the cache, once in the list, and once individually.

This is some example code:

minExampleSims<-drake_plan(minExampleSims=simFun())%>% expand_plan(paste0('rep',1:5))
minSimsGather<-gather_plan(minExampleSims,'minSims','list')
resF<-'minRes.pdf'
makePdf<-drake_plan(minRes.pdf={pdf(resF)
	  for(sim_i in minSims)
	  {
		sim_i$visualizePlot()
		print( sim_i$visualizePlotMetricsWithRealData())
	  }
	  dev.off()
},file_targets=T)

I guess the same problem would occur when one fits a lot of model (e.g. gsp example) and want to use the default plot function on each model.

I wondered if I'm missing something or if this is not possible. An alternative approach would be to first reduce all the simulations (for example make plots using ggplot per simulation or extracting the necessary data) gather these and then plot them. This is not always a very easy option when preexisting plot functions are use.

I wonder if it would be useful/possible to have some kind of recursive version of gather that loads caches one by one

@wlandau-lilly
Copy link
Collaborator

Thanks for bringing this up, @bart1. It can be tricky to set up workflows when you cannot load everything into memory. In your case, I would recommend that you not gather these massive simulation objects. Even if drake someday has a more memory-efficient way of creating the gathered list, we cannot avoid the fact that the list will be too big to load into memory anyway. Below, I have sketched a modified workflow with no gathering. If you do want to gather things, I recommend you gather small summaries of the simulations rather than the simulations themselves.

library(drake)
library(magrittr)

sim_fun <- function(rep, ...) {
  data.frame(x = rnorm(25), y = rnorm(25))
}

save_plot <- function(data, file) {
  pdf(file)
  plot(y ~ x, data = data)
  dev.off()
}

plan <- drake_plan(sim = sim_fun("REP"), save_plot(sim_REP, file_out("sim_REP.pdf")), 
  strings_in_dots = "literals") %>% evaluate_plan(wildcard = "REP", values = paste0("rep", 
  1:5)) %>% print
#> # A tibble: 10 x 2
#>    target             command                                          
#>    <chr>              <chr>                                            
#>  1 sim_rep1           "sim_fun(\"rep1\")"                              
#>  2 sim_rep2           "sim_fun(\"rep2\")"                              
#>  3 sim_rep3           "sim_fun(\"rep3\")"                              
#>  4 sim_rep4           "sim_fun(\"rep4\")"                              
#>  5 sim_rep5           "sim_fun(\"rep5\")"                              
#>  6 "\"sim_rep1.pdf\"" "save_plot(sim_rep1, file_out(\"sim_rep1.pdf\"))"
#>  7 "\"sim_rep2.pdf\"" "save_plot(sim_rep2, file_out(\"sim_rep2.pdf\"))"
#>  8 "\"sim_rep3.pdf\"" "save_plot(sim_rep3, file_out(\"sim_rep3.pdf\"))"
#>  9 "\"sim_rep4.pdf\"" "save_plot(sim_rep4, file_out(\"sim_rep4.pdf\"))"
#> 10 "\"sim_rep5.pdf\"" "save_plot(sim_rep5, file_out(\"sim_rep5.pdf\"))"

vis_drake_graph(drake_config(plan))

capture

make(plan)
#> target sim_rep1
#> target sim_rep2
#> target sim_rep3
#> target sim_rep4
#> target sim_rep5
#> target file "sim_rep1.pdf"
#> target file "sim_rep2.pdf"
#> target file "sim_rep3.pdf"
#> target file "sim_rep4.pdf"
#> target file "sim_rep5.pdf"

@bart1
Copy link
Author

bart1 commented Mar 15, 2018

Just as a small addition, with the merge_pdf function from tabulizer you can create one command to make a single pdf, here is an example how this all can be combined into one pdf

require(tabulizer)
gatherCmd<-data.frame(target="'comb.pdf'",command=paste('merge_pdfs(c(',paste('file_in("sim_rep',1:5,'.pdf")',sep='', collapse=','),"), file_out('comb.pdf'))"))
plan<-rbind(plan, gatherCmd)

@wlandau
Copy link
Member

wlandau commented Mar 15, 2018

Nice! Have you heard of patchwork? I have never used it myself, but it sounds like another way to do this if you can use ggplot2.

@bart1
Copy link
Author

bart1 commented Mar 15, 2018

That looks like a nice package, have used some alternatives to that before just a quick example:

library(drake)
library(magrittr)
library(ggplot2)
require(patchwork)
sim_fun <- function(rep, ...) {
  data.frame(x = rnorm(25), y = rnorm(25))
}


sims <- drake_plan(sim = sim_fun("REP"), strings_in_dots = "literals") %>% 
evaluate_plan(wildcard = "REP", values = paste0("rep", 
                                                                                                     1:5)) %>% print
plots<-drake_plan(plt=ggplot(data=dataset__, aes(x=x,y=y))+geom_point()) %>% plan_analyses(sims)
comb<-data.frame(target="'plot.pdf'", 
command=paste('ggsave(',paste(plots$target, collapse=' + '),", file=file_out('plot.pdf'))"))
plan<-rbind(sims, plots,comb)
vis_drake_graph(drake_config(plan))
make(plan)

Would there be an alternative to generating the command with paste? or is that for the time being the most efficient?

@wlandau
Copy link
Member

wlandau commented Mar 15, 2018

Maybe a new function like reduce_plan()? gather_plan() assumes you have a function like list() or rbind() and the targets you gather are named arguments separated by commas. A reduce_plan() function might look something like this:

plots
## # A tibble: 5 x 2
##   target       command                                                  
##   <chr>        <chr>                                                    
## 1 plt_sim_rep1 ggplot(data = sim_rep1, aes(x = x, y = y)) + geom_point()
## 2 plt_sim_rep2 ggplot(data = sim_rep2, aes(x = x, y = y)) + geom_point()
## 3 plt_sim_rep3 ggplot(data = sim_rep3, aes(x = x, y = y)) + geom_point()
## 4 plt_sim_rep4 ggplot(data = sim_rep4, aes(x = x, y = y)) + geom_point()
## 5 plt_sim_rep5 ggplot(data = sim_rep5, aes(x = x, y = y)) + geom_point()

reduce_plan(plots, op = "+", target = "reduced_target")
## # A tibble: 5 x 2
##   target         command                                                  
##   <chr>          <chr>
##   reduced_target plt_sim_rep1 + plt_sim_rep2 + plt_sim_rep3 + plt_sim_rep4 + plt_sim_rep5

I would definitely welcome a pull request with the implementation.

@wlandau
Copy link
Member

wlandau commented Jan 26, 2019

Update: development drake now has a much friendlier (experimental) API. It is now easier to gather by specific groups. Details: https://ropenscilabs.github.io/drake-manual/plans.html#create-large-plans-the-easy-way. combine(.by = ...) is like dplyr::group_by().

@wlandau
Copy link
Member

wlandau commented Jan 31, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants