Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C-stack achieved at quickly as the number of variables increase #548

Closed
Bijaelo opened this issue Jul 24, 2020 · 8 comments · Fixed by #613
Closed

C-stack achieved at quickly as the number of variables increase #548

Bijaelo opened this issue Jul 24, 2020 · 8 comments · Fixed by #613

Comments

@Bijaelo
Copy link

Bijaelo commented Jul 24, 2020

I started looking into the recipes package, which seems to provide some incredible tools for not only preprocessing but also evaluating at a production level, if speed is not of the essence. Especially the latter part is something that is extremely useful.

In my current line of work however I often work around 300 variables, and this quickly turns out to be a problem for the recipes, as it hits the stack limit rather quickly

library(recipes)
nmax <- 300
data <- matrix(rnorm(nmax), ncol = nmax)
data <- as.data.frame(data)
#Success vector
v <- logical(nmax - 1)
for(i in seq_len(nmax)[-1]){
  e <- try(recipe(data, reformulate(paste0('V', seq(2, i), collapse = '+'), response = 'V1')), silent = TRUE)
  v[i - 1] <- !inherits(e, 'try-error')
  rm(e)
  gc()
}
#At which point does recipes fail?
which.min(unlist(v))
# [1] 144

The problem seems to be how ´recipes:::fun_calls` recursively evaluates formulas.

form <- reformulate(paste0('V', 2:200, collapse = '+'), 'V1')
recipes:::fun_calls(form)
#Error: C stack usage  7976588 is too close to the limit

while I am not entirely sure how this could be avoided, it seems like it should be possible to split the formula and evaluate in a less recursive way.

This was experienced in a Docker container (using rocker/tidyverse:4.0.0), with an allocated memory of 6 GB, 1.5 GB swap memory, 60 GB disk space and 4 CPU's.

SessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] recipes_0.1.13 dplyr_1.0.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6         plyr_1.8.6           pillar_1.4.4         compiler_4.0.0       gower_0.2.1          class_7.3-16         iterators_1.0.12     tools_4.0.0         
 [9] rpart_4.1-15         ipred_0.9-9          nlme_3.1-147         gtable_0.3.0         lubridate_1.7.8      lifecycle_0.2.0      tibble_3.0.3         lattice_0.20-41     
[17] pkgconfig_2.0.3      rlang_0.4.7          Matrix_1.2-18        foreach_1.5.0        DBI_1.1.0            rstudioapi_0.11      prodlim_2019.11.13   stringr_1.4.0       
[25] withr_2.2.0          pROC_1.16.2          generics_0.0.2       vctrs_0.3.2          stats4_4.0.0         grid_4.0.0           nnet_7.3-13          caret_6.0-86        
[33] tidyselect_1.1.0     data.table_1.12.8    glue_1.4.1           R6_2.4.1             survival_3.1-12      lava_1.6.7           reshape2_1.4.4       ggplot2_3.3.2       
[41] purrr_0.3.4          magrittr_1.5         ModelMetrics_1.2.2.2 scales_1.1.1         codetools_0.2-16     ellipsis_0.3.0       MASS_7.3-51.5        splines_4.0.0       
[49] colorspace_1.4-1     timeDate_3043.102    stringi_1.4.6        munsell_0.5.0        crayon_1.3.4

For completeness sake, this error can be avoided completely by using the roles argument rather than a formula., but it seems inconvenient to have one functionality simply crash the function.

@Bijaelo Bijaelo closed this as completed Jul 24, 2020
@Bijaelo Bijaelo reopened this Jul 24, 2020
@EmilHvitfeldt
Copy link
Member

I feel like this might be a copy of #467.

You can pass the data to recipe() without a formula, and then use update_role() to set outcomes and predictors

library(recipes)
nmax <- 300
data <- matrix(rnorm(nmax), ncol = nmax)
data <- as.data.frame(data)
#Success vector

recipe(data) %>%
  update_role(everything()) %>%
  update_role(V1, new_role = "outcome")
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor        299

Created on 2020-07-24 by the reprex package (v0.3.0)

3 things to note:

  • This is the same as specifying V1 ~ .
  • update_role() defaults to predictor.
  • update_role() overwrites, so it will only have "outcome" as its role.

@Bijaelo
Copy link
Author

Bijaelo commented Jul 25, 2020

Hi @EmilHvitfeldt, I added a last minute note when writing the post at the very end, also stating the alternative (using the role argument) which also ended up being the method I went with to get past this situation. But seeming as formula's havsbeen a core part of the R language since... forever, and as it has been included in the package, it seems sensible to make it as fireproof as possible. From the previous post it seems that topepo isn't interested in rewriting the function to avoid this bug, and despite agreeing with it being it being 'frustrating' this feature seems essential, if the goal of the package is to become a core part of every R programmers day-to-day work and learning curve (eg. Get it into university curriculum).

I spent a bit of twiddling, I succeeded in converting the fun_calls function and removing the recursive call. The basic idea was to replace the recursive call by imitating a stack containing the individual component of f and iterating until this stack has depleted by evaluating each component seperately. For convenience I made certain the function extends the stack using the LIFO (last in first out) principle, which yields the same result as recipes:::fun_calls such that identical could be used to compare a series of tests. The function has one new argument maxdepth = 1e5, as I decided to go for a for-loop over do-while for performance reasons (I find it highly unlikely that a function is encountered with a nested depth of more than 1e5).

I've also included a microbenchmark which shows that the difference in computation time is insignificant for anything but quosures (which I have no experience with, so my testing here is limited) while it seems to be slightly faster for "large calls". If it is of interest I'll try to figure out how to create a proper pull-request with this suggested function. For reference it takes about 700ms to evaluate a formula of 2500 parameters (not included) on my horribly tiny and dying laptop.

fun_calls <- compiler::cmpfun(function (f, maxdepth = 1e5) 
{
  if(missing(maxdepth) || !is.numeric(maxdepth))
    maxdepth <- 1e5
  stack <- list(f)
  out <- c()
  # For loops are slightly faster than while(TRUE)..break loops, even compiled
  for(j in seq_len(maxdepth)){
    f <- stack[[1]]
    stack <- stack[-1]
    # Handle rare cases where unnesting function bodies cause 
    if(!missing(f)){
      if (is.function(f)) {
        stack <- c(body(f), stack)
      }
      else if (rlang::is_quosure(f)) {
        stack <- c(rlang::quo_get_expr(f), stack)
      }
      else if (is.call(f)) {
        fname <- as.character(f[[1]])
        if (!identical(fname, ".Internal")){
          # Unnest calls (similar to unlist(lapply(f[-1, fun_calls]), use.names = FALSE)), but a bit more complicated to ensure 
          z <- f[-1]
          stacki <- vector('list', n <- length(z))
          for(i in seq_len(n)){
            stacki[[i]] <- z[[i]]
          }
          stack <- c(stacki, stack)
        }
        out <- c(out, fname)
      }
    }
    if(identical(stack, list())){
      # Avoid any "last iteration succes" errors being thrown
      j <- 0 
      break
    }
  }
  if(j == maxdepth)
    rlang::abort('Reached max depth during fun_calls. ')
  unique(out)
})

ff <- recipes:::fun_calls
form <- reformulate(paste0('V', 2:60, collapse = '+'), 'V1')
identical(fun_calls(form),ff(form))
#> [1] TRUE

form2 <- reformulate(paste0('log(exp(s(V', 2:60, ', 24)))', collapse = '+'), 'V1')
identical(fun_calls(form2),ff(form2))
#> [1] TRUE

identical(fun_calls(substr), ff(substr))
#> [1] TRUE

identical(fun_calls(data.frame),ff(data.frame))
#> [1] TRUE

identical(fun_calls(glm), ff(glm))
#> [1] TRUE

library(data.table)
identical(fun_calls(data.table),ff(data.table))
#> [1] TRUE

library(broom)
identical(fun_calls(tidy),ff(tidy))
#> [1] TRUE

library(rlang)
#> 
#> Attaching package: 'rlang'
#> The following object is masked from 'package:data.table':
#> 
#>     :=
quo <- quo(my_quosure)

identical(ff(quo),fun_calls(quo))
#> [1] TRUE

microbenchmark::microbenchmark(form_simple_recipes = ff(form),
                               form_simple_loop = fun_calls(form),
                               form_complex_recipes = ff(form2),
                               form_complex_loop = fun_calls(form2),
                               substr_recipes = ff(substr),
                               substr_loop = fun_calls(substr),
                               data.frame_recipes = ff(data.frame),
                               data.frame_loop = fun_calls(data.frame),
                               data.table_recipes = ff(data.table),
                               data.table_loop = fun_calls(data.table),
                               glm_recipes = ff(glm),
                               glm_loop = fun_calls(glm),
                               tidy_recipes = ff(tidy),
                               tidy_loop = fun_calls(tidy),
                               quosure_recipes = ff(quo),
                               quosure_loop = fun_calls(quo), control = list(warmup = 10))
#> Unit: microseconds
#>                  expr     min       lq      mean   median       uq     max
#>   form_simple_recipes  1496.2  2231.45  3392.592  2980.55  4072.95  8012.1
#>      form_simple_loop  1453.9  2484.35  3898.926  3340.05  4634.50 17725.3
#>  form_complex_recipes  7892.5  9482.55 12299.377 10977.00 14108.75 25754.2
#>     form_complex_loop  6928.4  8702.55 11665.488 10115.50 12371.80 36923.7
#>        substr_recipes   131.4   182.20   441.976   248.05   495.50  2482.6
#>           substr_loop   120.0   174.05   309.978   213.10   322.50  1459.5
#>    data.frame_recipes 11983.3 14894.65 18703.935 16339.65 20411.20 37573.2
#>       data.frame_loop 13270.0 16193.95 20043.610 17797.60 23027.55 40582.7
#>    data.table_recipes  2152.2  3254.10  4805.194  4387.00  5942.50 15006.0
#>       data.table_loop  2003.3  3342.50  5113.217  4642.65  5863.25 16101.3
#>           glm_recipes  4523.8  6292.45  8267.063  7581.15  9540.95 21667.2
#>              glm_loop  4775.1  7691.35  9531.024  8561.65 11214.10 16053.6
#>          tidy_recipes    42.4    62.90   134.212    97.25   122.40  1550.6
#>             tidy_loop    51.0    80.10   200.343   109.35   187.55  3382.8
#>       quosure_recipes     2.9    12.85    17.874    16.30    20.35    78.4
#>          quosure_loop    43.0    71.60   110.045    97.35   124.95   394.1

sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_Denmark.1252  LC_CTYPE=English_Denmark.1252   
#> [3] LC_MONETARY=English_Denmark.1252 LC_NUMERIC=C                    
#> [5] LC_TIME=English_Denmark.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rlang_0.4.6       broom_0.5.6       data.table_1.12.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4.6         compiler_4.0.0       pillar_1.4.4        
#>  [4] gower_0.2.1          highr_0.8            class_7.3-16        
#>  [7] tools_4.0.0          rpart_4.1-15         digest_0.6.25       
#> [10] ipred_0.9-9          nlme_3.1-147         lubridate_1.7.8     
#> [13] evaluate_0.14        tibble_3.0.1         lifecycle_0.2.0     
#> [16] lattice_0.20-41      pkgconfig_2.0.3      Matrix_1.2-18       
#> [19] microbenchmark_1.4-7 yaml_2.2.1           prodlim_2019.11.13  
#> [22] xfun_0.13            withr_2.2.0          dplyr_0.8.5         
#> [25] stringr_1.4.0        knitr_1.28           generics_0.0.2      
#> [28] vctrs_0.3.0          recipes_0.1.13       nnet_7.3-13         
#> [31] grid_4.0.0           tidyselect_1.1.0     glue_1.4.0          
#> [34] R6_2.4.1             survival_3.1-12      rmarkdown_2.1       
#> [37] lava_1.6.7           tidyr_1.0.3          purrr_0.3.4         
#> [40] magrittr_1.5         backports_1.1.6      splines_4.0.0       
#> [43] ellipsis_0.3.0       htmltools_0.4.0      MASS_7.3-51.6       
#> [46] assertthat_0.2.1     timeDate_3043.102    stringi_1.4.6       
#> [49] crayon_1.3.4

Created on 2020-07-25 by the reprex package (v0.3.0)

@Bijaelo
Copy link
Author

Bijaelo commented Jul 27, 2020

From the project page I can see this was something that was looked upon back in 2017. It might be worth having additional eyes on this specific topic (@topepo ?). I would be glad to help extend this package, as the potential seems great.

@Bijaelo
Copy link
Author

Bijaelo commented Jul 30, 2020

Going further, currently - is not allowed in the formula. This seems to be due to get_rhs_vars, but could be alleviated by using

predictor_names <- attr(data_info, "term.labels")

which only includes the variable name used in the terms, rather than

predictor_names <- names(attr(data_info, "dataClasses"))

which includes all variables similar to unique(all.vars(rlang::f_rhs(form)))

Automatic transformations could similar be obtained from attr(data_info, 'variables') and the returned recipe could rather simply return an object with an initial step_mutate derived from the formula.

@Bijaelo
Copy link
Author

Bijaelo commented Aug 28, 2020

This issue has been open for some time now. Any update available?

@topepo
Copy link
Member

topepo commented Aug 28, 2020

We get to issues when we get to them.

The formula method does exactly what we want it to. In your case, the non-formula method is a much better approach. I'll put some more documentation into ?recipes to point people towards the non-formula method with large dimensions.

@Bijaelo
Copy link
Author

Bijaelo commented Aug 29, 2020

It seems odd @topepo to not go further with this. In the second comment I've come with an alternative for fun_calls which fixes the issue. While it does need testing (a bit of ad-hoc testing was provided), it seems that the step to close the issue is small, and would make the extra documentation unnecessary.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org

@github-actions github-actions bot locked and limited conversation to collaborators Feb 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants