-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C-stack achieved at quickly as the number of variables increase #548
Comments
I feel like this might be a copy of #467. You can pass the data to library(recipes)
nmax <- 300
data <- matrix(rnorm(nmax), ncol = nmax)
data <- as.data.frame(data)
#Success vector
recipe(data) %>%
update_role(everything()) %>%
update_role(V1, new_role = "outcome")
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> outcome 1
#> predictor 299 Created on 2020-07-24 by the reprex package (v0.3.0) 3 things to note:
|
Hi @EmilHvitfeldt, I added a last minute note when writing the post at the very end, also stating the alternative (using the I spent a bit of twiddling, I succeeded in converting the I've also included a microbenchmark which shows that the difference in computation time is insignificant for anything but quosures (which I have no experience with, so my testing here is limited) while it seems to be slightly faster for "large calls". If it is of interest I'll try to figure out how to create a proper pull-request with this suggested function. For reference it takes about 700ms to evaluate a formula of 2500 parameters (not included) on my horribly tiny and dying laptop. fun_calls <- compiler::cmpfun(function (f, maxdepth = 1e5)
{
if(missing(maxdepth) || !is.numeric(maxdepth))
maxdepth <- 1e5
stack <- list(f)
out <- c()
# For loops are slightly faster than while(TRUE)..break loops, even compiled
for(j in seq_len(maxdepth)){
f <- stack[[1]]
stack <- stack[-1]
# Handle rare cases where unnesting function bodies cause
if(!missing(f)){
if (is.function(f)) {
stack <- c(body(f), stack)
}
else if (rlang::is_quosure(f)) {
stack <- c(rlang::quo_get_expr(f), stack)
}
else if (is.call(f)) {
fname <- as.character(f[[1]])
if (!identical(fname, ".Internal")){
# Unnest calls (similar to unlist(lapply(f[-1, fun_calls]), use.names = FALSE)), but a bit more complicated to ensure
z <- f[-1]
stacki <- vector('list', n <- length(z))
for(i in seq_len(n)){
stacki[[i]] <- z[[i]]
}
stack <- c(stacki, stack)
}
out <- c(out, fname)
}
}
if(identical(stack, list())){
# Avoid any "last iteration succes" errors being thrown
j <- 0
break
}
}
if(j == maxdepth)
rlang::abort('Reached max depth during fun_calls. ')
unique(out)
})
ff <- recipes:::fun_calls
form <- reformulate(paste0('V', 2:60, collapse = '+'), 'V1')
identical(fun_calls(form),ff(form))
#> [1] TRUE
form2 <- reformulate(paste0('log(exp(s(V', 2:60, ', 24)))', collapse = '+'), 'V1')
identical(fun_calls(form2),ff(form2))
#> [1] TRUE
identical(fun_calls(substr), ff(substr))
#> [1] TRUE
identical(fun_calls(data.frame),ff(data.frame))
#> [1] TRUE
identical(fun_calls(glm), ff(glm))
#> [1] TRUE
library(data.table)
identical(fun_calls(data.table),ff(data.table))
#> [1] TRUE
library(broom)
identical(fun_calls(tidy),ff(tidy))
#> [1] TRUE
library(rlang)
#>
#> Attaching package: 'rlang'
#> The following object is masked from 'package:data.table':
#>
#> :=
quo <- quo(my_quosure)
identical(ff(quo),fun_calls(quo))
#> [1] TRUE
microbenchmark::microbenchmark(form_simple_recipes = ff(form),
form_simple_loop = fun_calls(form),
form_complex_recipes = ff(form2),
form_complex_loop = fun_calls(form2),
substr_recipes = ff(substr),
substr_loop = fun_calls(substr),
data.frame_recipes = ff(data.frame),
data.frame_loop = fun_calls(data.frame),
data.table_recipes = ff(data.table),
data.table_loop = fun_calls(data.table),
glm_recipes = ff(glm),
glm_loop = fun_calls(glm),
tidy_recipes = ff(tidy),
tidy_loop = fun_calls(tidy),
quosure_recipes = ff(quo),
quosure_loop = fun_calls(quo), control = list(warmup = 10))
#> Unit: microseconds
#> expr min lq mean median uq max
#> form_simple_recipes 1496.2 2231.45 3392.592 2980.55 4072.95 8012.1
#> form_simple_loop 1453.9 2484.35 3898.926 3340.05 4634.50 17725.3
#> form_complex_recipes 7892.5 9482.55 12299.377 10977.00 14108.75 25754.2
#> form_complex_loop 6928.4 8702.55 11665.488 10115.50 12371.80 36923.7
#> substr_recipes 131.4 182.20 441.976 248.05 495.50 2482.6
#> substr_loop 120.0 174.05 309.978 213.10 322.50 1459.5
#> data.frame_recipes 11983.3 14894.65 18703.935 16339.65 20411.20 37573.2
#> data.frame_loop 13270.0 16193.95 20043.610 17797.60 23027.55 40582.7
#> data.table_recipes 2152.2 3254.10 4805.194 4387.00 5942.50 15006.0
#> data.table_loop 2003.3 3342.50 5113.217 4642.65 5863.25 16101.3
#> glm_recipes 4523.8 6292.45 8267.063 7581.15 9540.95 21667.2
#> glm_loop 4775.1 7691.35 9531.024 8561.65 11214.10 16053.6
#> tidy_recipes 42.4 62.90 134.212 97.25 122.40 1550.6
#> tidy_loop 51.0 80.10 200.343 109.35 187.55 3382.8
#> quosure_recipes 2.9 12.85 17.874 16.30 20.35 78.4
#> quosure_loop 43.0 71.60 110.045 97.35 124.95 394.1
sessionInfo()
#> R version 4.0.0 (2020-04-24)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19041)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_Denmark.1252 LC_CTYPE=English_Denmark.1252
#> [3] LC_MONETARY=English_Denmark.1252 LC_NUMERIC=C
#> [5] LC_TIME=English_Denmark.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rlang_0.4.6 broom_0.5.6 data.table_1.12.8
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 compiler_4.0.0 pillar_1.4.4
#> [4] gower_0.2.1 highr_0.8 class_7.3-16
#> [7] tools_4.0.0 rpart_4.1-15 digest_0.6.25
#> [10] ipred_0.9-9 nlme_3.1-147 lubridate_1.7.8
#> [13] evaluate_0.14 tibble_3.0.1 lifecycle_0.2.0
#> [16] lattice_0.20-41 pkgconfig_2.0.3 Matrix_1.2-18
#> [19] microbenchmark_1.4-7 yaml_2.2.1 prodlim_2019.11.13
#> [22] xfun_0.13 withr_2.2.0 dplyr_0.8.5
#> [25] stringr_1.4.0 knitr_1.28 generics_0.0.2
#> [28] vctrs_0.3.0 recipes_0.1.13 nnet_7.3-13
#> [31] grid_4.0.0 tidyselect_1.1.0 glue_1.4.0
#> [34] R6_2.4.1 survival_3.1-12 rmarkdown_2.1
#> [37] lava_1.6.7 tidyr_1.0.3 purrr_0.3.4
#> [40] magrittr_1.5 backports_1.1.6 splines_4.0.0
#> [43] ellipsis_0.3.0 htmltools_0.4.0 MASS_7.3-51.6
#> [46] assertthat_0.2.1 timeDate_3043.102 stringi_1.4.6
#> [49] crayon_1.3.4 Created on 2020-07-25 by the reprex package (v0.3.0) |
Going further, currently predictor_names <- attr(data_info, "term.labels") which only includes the variable name used in the terms, rather than predictor_names <- names(attr(data_info, "dataClasses")) which includes all variables similar to Automatic transformations could similar be obtained from |
This issue has been open for some time now. Any update available? |
We get to issues when we get to them. The formula method does exactly what we want it to. In your case, the non-formula method is a much better approach. I'll put some more documentation into |
It seems odd @topepo to not go further with this. In the second comment I've come with an alternative for |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org |
I started looking into the recipes package, which seems to provide some incredible tools for not only preprocessing but also evaluating at a production level, if speed is not of the essence. Especially the latter part is something that is extremely useful.
In my current line of work however I often work around 300 variables, and this quickly turns out to be a problem for the recipes, as it hits the stack limit rather quickly
The problem seems to be how ´recipes:::fun_calls` recursively evaluates formulas.
while I am not entirely sure how this could be avoided, it seems like it should be possible to
split
the formula and evaluate in a less recursive way.This was experienced in a Docker container (using
rocker/tidyverse:4.0.0
), with an allocated memory of 6 GB, 1.5 GB swap memory, 60 GB disk space and 4 CPU's.For completeness sake, this error can be avoided completely by using the
roles
argument rather than a formula., but it seems inconvenient to have one functionality simply crash the function.The text was updated successfully, but these errors were encountered: