Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function reduce_plan() #326

Closed
wlandau opened this issue Mar 15, 2018 · 6 comments
Closed

New function reduce_plan() #326

wlandau opened this issue Mar 15, 2018 · 6 comments

Comments

@wlandau
Copy link
Member

wlandau commented Mar 15, 2018

From #325 (comment). Thanks to @bart1 for the idea.

@bart1
Copy link

bart1 commented Mar 15, 2018

I like the idea, I'm was wondering if reduce would be the right name for such a function since reduce for me implies gathering iteratively, as I guess I was originally looking for in #325 . I guess this expectation is based also on the base R Reduce where ?base::Reduce states

 ‘Reduce’ uses a binary function to successively combine the elements of a given vector and a possibly given initial value.

Would another synonym of gather not be a better name e.g aggregate or collect

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Mar 15, 2018

For reduce_plan(), I am actually thinking of a couple different user-side options. One is to combine everything in one command.

x_plan
## # A tibble: 8 x 2
##   target command
##   <chr>  <chr>  
## 1 x_1    1      
## 2 x_2    2      
## 3 x_3    3      
## 4 x_4    4      
## 5 x_5    5      
## 6 x_6    6      
## 7 x_7    7      
## 8 x_8    8

reduce_plan(datasets, target = "x_sum")
## # A tibble: 1 x 2
##   target command                                      
##   <chr>  <chr>                                        
## 1 x_sum  x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8

Another is to do a pairwise reduction, which could be parallelized with the jobs argument to make(). I think we should consider similar functionality for #233 (cc @krlmlr).

reduce_plan(datasets, target = "x_sum", pairwise = TRUE)
## # A tibble: 1 x 2
##   target command                                      
##   <chr>  <chr>                                        
## 1 x_sum_1  x_1 + x_2
## 2 x_sum_2  x_3 + x_4
## 3 x_sum_3  x_5 + x_6
## 4 x_sum_4  x_7 + x_8
## 5 x_sum_5  x_sum_1 + x_sum_2
## 6 x_sum_6  x_sum_3 + x_sum_4
## 7 x_sum    x_sum_5 + x_sum_6

I am not sure a pairwise gather_plan() is appropriate all the time. It could be useful for gather_plan(gather = "c"), but gather_plan(gather = "list") would turn a (nearly) binary tree instead of a flat list.

@wlandau-lilly
Copy link
Collaborator

wlandau-lilly commented Mar 15, 2018

Update: I implemented a new reduce_plan() function in the new i326 branch. It should allow you to do naive and pairwise reductions with binary operators and arbitrary functions that take at least two arguments. It seems to work for both even and odd numbers of targets, but I will need to add some tests before I do a PR and merge it. reduce_plan() is convenient and small enough that I think we can roll it into the CRAN release of 5.1.0 next week.

I really like this feature. It generalizes gather_plan() and helps you avoid memory issues and super long commands.

library(drake)
x_plan <- evaluate_plan(drake_plan(x = VALUE), wildcard = "VALUE", values = 1:9)
x_plan
#> # A tibble: 9 x 2
#>   target command
#>   <chr>  <chr>  
#> 1 x_1    1      
#> 2 x_2    2      
#> 3 x_3    3      
#> 4 x_4    4      
#> 5 x_5    5      
#> 6 x_6    6      
#> 7 x_7    7      
#> 8 x_8    8      
#> 9 x_9    9
reduce_plan(x_plan, target = "x_sum", begin = "", end = "")
#> # A tibble: 1 x 2
#>   target command                                                          
#>   <chr>  <chr>                                                            
#> 1 x_sum  x_1  +  x_2  +  x_3  +  x_4  +  x_5  +  x_6  +  x_7  +  x_8  +  …
reduce_plan(x_plan, target = "x_sum", pairwise = TRUE)
#> # A tibble: 8 x 2
#>   target  command            
#>   <chr>   <chr>              
#> 1 x_sum_1 (x_1 + x_2)        
#> 2 x_sum_2 (x_3 + x_4)        
#> 3 x_sum_3 (x_5 + x_6)        
#> 4 x_sum_4 (x_7 + x_8)        
#> 5 x_sum_5 (x_9 + x_sum_1)    
#> 6 x_sum_6 (x_sum_2 + x_sum_3)
#> 7 x_sum_7 (x_sum_4 + x_sum_5)
#> 8 x_sum   (x_sum_6 + x_sum_7)
reduce_plan(
  x_plan,
  target = "x_sum",
  pairwise = TRUE,
  begin = "fun(", op = ", ", 
  end = ")"
)
#> # A tibble: 8 x 2
#>   target  command              
#>   <chr>   <chr>                
#> 1 x_sum_1 fun(x_1, x_2)        
#> 2 x_sum_2 fun(x_3, x_4)        
#> 3 x_sum_3 fun(x_5, x_6)        
#> 4 x_sum_4 fun(x_7, x_8)        
#> 5 x_sum_5 fun(x_9, x_sum_1)    
#> 6 x_sum_6 fun(x_sum_2, x_sum_3)
#> 7 x_sum_7 fun(x_sum_4, x_sum_5)
#> 8 x_sum   fun(x_sum_6, x_sum_7)

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 15, 2018

How is this different from pack() in #304?

@bart1
Copy link

bart1 commented Mar 15, 2018

That looks really nice, a quick example shows it also works well with fore xample combining ggplots:

require(drake)
require(patchwork)
require(ggplot2)
require(magrittr)
plots<-drake_plan(plot=ggplot(data=data.frame(x=rnorm(10), y=rnorm(10)))+geom_point(aes(x=x,y=y))) %>% 
  expand_plan(c("rep1", "rep2", "rep3", "rep4"))
plotsGathered<-reduce_plan(plots, target = "fullPlot", begin='',end='')
plan<-rbind(plots, plotsGathered, 
            drake_plan(ggsave(filename=file_out('test.pdf'), fullPlot))
            )
drake_graph(drake_config(plan))
make(plan)

One thing i wonder about is that currently it uses start and end a lot (for each pair):

> x_plan <- evaluate_plan(drake_plan(x = VALUE), wildcard = "VALUE", values = 1:9)
> reduce_plan(x_plan, target = "x_sum")
# A tibble: 1 x 2
  target command                                                            
  <chr>  <chr>                                                              
1 x_sum  ((((((((x_1 + x_2) + x_3) + x_4) + x_5) + x_6) + x_7) + x_8) + x_9)

I was thinking it might be better to only use it once when pairwise = FALSE since operators anyway work pairwise and it is not going to give any significant memory reduction since x_1 till x_9 are anyway going to be loaded at the same time before the command is run if I'm right (and thus should one use pairwise=TRUE)?

something like this hypothetical example:

> reduce_plan(x_plan, target = "x_sum", start='ggsave(', end=',filename=file_out("test.pdf")')
# A tibble: 1 x 2
  target command                                                            
  <chr>  <chr>                                                              
1 "\"test.pdf\""  ggsave(x_1 + x_2+ x_3+ x_4 + x_5 + x_6 + x_7 + x_8 + x_9, filename=file_out(\"test.pdf\"))

On the otherhand the current version keeps it more consistent between pairwise being true or false. I guess I dont know I see advantages and disadvantages of either

@wlandau
Copy link
Member Author

wlandau commented Mar 16, 2018

It's a good point, but I think something like start and end are important for each pair when pairwise = TRUE (now the default) so people can more easily define their own reductions that conserve memory. The version of start and end you mentioned is much easier to do manually post hoc, and I hesitate to encumber the interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants