Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high dimensional datasets? #467

Closed
fkgruber opened this issue Feb 7, 2020 · 4 comments
Closed

high dimensional datasets? #467

fkgruber opened this issue Feb 7, 2020 · 4 comments

Comments

@fkgruber
Copy link

fkgruber commented Feb 7, 2020

Hello
I'm finding issues using high dimensional datasets (as in genomic problems) with the recipe package where it generates stack overflow problems.

Minimal_ Reproducible Code:

library(tidymodels)
testdf = as_tibble(matrix(rnorm(500 * 20000), ncol = 20000))
rec = recipe(~., data = testdf)

The output is:

Error: protect(): protection stack overflow

Session Info:

sessionInfo()

# or sessioninfo::session_info()
> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yardstick_0.0.5  tibble_2.1.3     rsample_0.0.5    tidyr_1.0.0     
 [5] recipes_0.1.9    purrr_0.3.3      parsnip_0.0.5    infer_0.5.1     
 [9] ggplot2_3.2.1    dplyr_0.8.3      dials_0.0.4      scales_1.0.0    
[13] broom_0.5.2      tidymodels_0.0.3

loaded via a namespace (and not attached):
  [1] minqa_1.2.4          colorspace_1.4-1     class_7.3-15        
  [4] ggridges_0.5.1       rsconnect_0.8.13     markdown_0.9        
  [7] base64enc_0.1-3      tidytext_0.2.2       rstudioapi_0.10     
 [10] listenv_0.7.0        furrr_0.1.0          rstan_2.19.2        
 [13] SnowballC_0.6.0      DT_0.5               prodlim_2018.04.18  
 [16] lubridate_1.7.4      codetools_0.2-16     splines_3.5.3       
 [19] knitr_1.22           shinythemes_1.1.2    zeallot_0.1.0       
 [22] bayesplot_1.7.1      nloptr_1.2.1         pROC_1.16.1         
 [25] shiny_1.4.0          compiler_3.5.3       backports_1.1.5     
 [28] assertthat_0.2.1     Matrix_1.2-17        fastmap_1.0.1       
 [31] lazyeval_0.2.2       cli_1.1.0            later_1.0.0         
 [34] htmltools_0.4.0      prettyunits_1.0.2    tools_3.5.3         
 [37] igraph_1.2.4.1       gtable_0.3.0         glue_1.3.1          
 [40] reshape2_1.4.3       Rcpp_1.0.2           DiceDesign_1.8-1    
 [43] vctrs_0.2.0          nlme_3.1-139         crosstalk_1.0.0     
 [46] timeDate_3043.102    gower_0.2.1          xfun_0.6            
 [49] stringr_1.4.0        globals_0.12.4       ps_1.3.0            
 [52] lme4_1.1-21          mime_0.6             miniUI_0.1.1.1      
 [55] lifecycle_0.1.0      gtools_3.8.1         tidypredict_0.4.3   
 [58] future_1.12.0        MASS_7.3-51.4        zoo_1.8-5           
 [61] ipred_0.9-8          rstanarm_2.19.2      colourpicker_1.0    
 [64] promises_1.1.0       parallel_3.5.3       inline_0.3.15       
 [67] shinystan_2.5.0      tidyposterior_0.0.2  gridExtra_2.3       
 [70] loo_2.1.0            StanHeaders_2.21.0-1 rpart_4.1-15        
 [73] stringi_1.4.3        tokenizers_0.2.1     dygraphs_1.1.1.6    
 [76] boot_1.3-20          pkgbuild_1.0.3       lava_1.6.5          
 [79] rlang_0.4.1          pkgconfig_2.0.3      matrixStats_0.54.0  
 [82] lattice_0.20-38      rstantools_2.0.0     htmlwidgets_1.5.1   
 [85] processx_3.3.0       tidyselect_0.2.5     plyr_1.8.4          
 [88] magrittr_1.5         R6_2.4.0             generics_0.0.2      
 [91] pillar_1.4.2         withr_2.1.2          xts_0.11-2          
 [94] survival_2.44-1.1    nnet_7.3-12          janeaustenr_0.1.5   
 [97] crayon_1.3.4         grid_3.5.3           callr_3.2.0         
[100] threejs_0.3.1        digest_0.6.22        xtable_1.8-4        
[103] httpuv_1.5.2         stats4_3.5.3         munsell_0.5.0       
[106] tcltk_3.5.3          shinyjs_1.0         
> 
@EmilHvitfeldt
Copy link
Member

This is happening because of the way R handles many variables in formulas. For now, you can pass your data without a formula and manually update the roles. update_role() defaults to predictor.

library(tidymodels)
testdf = as_tibble(matrix(rnorm(500 * 20000), ncol = 20000))
rec = recipe(testdf) %>%
  update_role(everything())
rec
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor      20000

Created on 2020-02-07 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.0 (2019-04-26)
#>  os       macOS Mojave 10.14.6        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2020-02-07                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version     date       lib
#>  assertthat      0.2.1       2019-03-21 [1]
#>  backports       1.1.5       2019-10-02 [1]
#>  base64enc       0.1-3       2015-07-28 [1]
#>  bayesplot       1.7.1       2019-12-01 [1]
#>  boot            1.3-23      2019-07-05 [1]
#>  broom         * 0.5.3       2019-12-14 [1]
#>  callr           3.4.1       2020-01-24 [1]
#>  class           7.3-15      2019-01-01 [1]
#>  cli             2.0.1.9000  2020-01-31 [1]
#>  codetools       0.2-16      2018-12-24 [1]
#>  colorspace      1.4-1       2019-03-18 [1]
#>  colourpicker    1.0         2017-09-27 [1]
#>  crayon          1.3.4       2020-01-31 [1]
#>  crosstalk       1.0.0       2016-12-21 [1]
#>  desc            1.2.0       2018-05-01 [1]
#>  devtools        2.2.1       2019-09-24 [1]
#>  dials         * 0.0.4.9000  2020-01-03 [1]
#>  DiceDesign      1.8-1       2019-07-31 [1]
#>  digest          0.6.23      2019-11-23 [1]
#>  dplyr         * 0.8.4       2020-01-31 [1]
#>  DT              0.10        2019-11-12 [1]
#>  dygraphs        1.1.1.6     2018-07-11 [1]
#>  ellipsis        0.3.0       2019-09-20 [1]
#>  evaluate        0.14        2019-05-28 [1]
#>  fansi           0.4.1       2020-01-08 [1]
#>  fastmap         1.0.1       2019-10-08 [1]
#>  foreach         1.4.7       2019-07-27 [1]
#>  fs              1.3.1       2019-05-06 [1]
#>  furrr           0.1.0       2018-05-16 [1]
#>  future          1.15.1      2019-11-25 [1]
#>  generics        0.0.2       2018-11-29 [1]
#>  ggplot2       * 3.3.0.9000  2020-01-31 [1]
#>  ggridges        0.5.1       2018-09-27 [1]
#>  globals         0.12.5      2019-12-07 [1]
#>  glue            1.3.1       2019-03-12 [1]
#>  gower           0.2.1       2019-05-14 [1]
#>  GPfit           1.0-8       2019-02-08 [1]
#>  gridExtra       2.3         2017-09-09 [1]
#>  gtable          0.3.0       2019-03-25 [1]
#>  gtools          3.8.1       2018-06-26 [1]
#>  highr           0.8         2019-03-20 [1]
#>  htmltools       0.4.0       2019-10-04 [1]
#>  htmlwidgets     1.5.1       2019-10-08 [1]
#>  httpuv          1.5.2       2019-09-11 [1]
#>  igraph          1.2.4.2     2019-11-27 [1]
#>  infer         * 0.5.1       2019-11-19 [1]
#>  inline          0.3.15      2018-05-18 [1]
#>  ipred           0.9-9       2019-04-28 [1]
#>  iterators       1.0.12      2019-07-26 [1]
#>  janeaustenr     0.1.5       2017-06-10 [1]
#>  knitr           1.27.2      2020-01-23 [1]
#>  later           1.0.0       2019-10-04 [1]
#>  lattice         0.20-38     2018-11-04 [1]
#>  lava            1.6.6       2019-08-01 [1]
#>  lhs             1.0.1       2019-02-03 [1]
#>  lifecycle       0.1.0       2019-08-01 [1]
#>  listenv         0.8.0       2019-12-05 [1]
#>  lme4            1.1-21      2019-03-05 [1]
#>  loo             2.1.0       2019-03-13 [1]
#>  lubridate       1.7.4       2018-04-11 [1]
#>  magrittr        1.5         2014-11-22 [1]
#>  markdown        1.1         2019-08-07 [1]
#>  MASS            7.3-51.4    2019-03-31 [1]
#>  Matrix          1.2-18      2019-11-27 [1]
#>  matrixStats     0.55.0      2019-09-07 [1]
#>  memoise         1.1.0       2017-04-21 [1]
#>  mime            0.9         2020-02-04 [1]
#>  miniUI          0.1.1.1     2018-05-18 [1]
#>  minqa           1.2.4       2014-10-09 [1]
#>  munsell         0.5.0       2018-06-12 [1]
#>  nlme            3.1-143     2019-12-10 [1]
#>  nloptr          1.2.1       2018-10-03 [1]
#>  nnet            7.3-12      2016-02-02 [1]
#>  parsnip       * 0.0.4.9000  2019-12-25 [1]
#>  pillar          1.4.3       2019-12-20 [1]
#>  pkgbuild        1.0.6       2019-10-09 [1]
#>  pkgconfig       2.0.3       2019-09-22 [1]
#>  pkgload         1.0.2       2018-10-29 [1]
#>  plyr            1.8.5       2019-12-10 [1]
#>  prettyunits     1.1.1       2020-01-24 [1]
#>  pROC            1.16.1      2020-01-14 [1]
#>  processx        3.4.1       2019-07-18 [1]
#>  prodlim         2019.11.13  2019-11-17 [1]
#>  promises        1.1.0       2019-10-04 [1]
#>  ps              1.3.0       2018-12-21 [1]
#>  purrr         * 0.3.3       2019-10-18 [1]
#>  R6              2.4.1       2019-11-12 [1]
#>  Rcpp            1.0.3       2019-11-08 [1]
#>  recipes       * 0.1.9       2020-01-16 [1]
#>  remotes         2.1.0.9000  2020-01-31 [1]
#>  reshape2        1.4.3       2017-12-11 [1]
#>  rlang           0.4.4       2020-01-28 [1]
#>  rmarkdown       2.1         2020-01-20 [1]
#>  rpart           4.1-15      2019-04-12 [1]
#>  rprojroot       1.3-2       2018-01-03 [1]
#>  rsample       * 0.0.5       2019-07-12 [1]
#>  rsconnect       0.8.16      2019-12-13 [1]
#>  rstan           2.19.2      2019-07-09 [1]
#>  rstanarm        2.19.2      2019-10-03 [1]
#>  rstantools      2.0.0       2019-09-15 [1]
#>  rstudioapi      0.10.0-9003 2020-01-31 [1]
#>  scales        * 1.1.0       2019-11-18 [1]
#>  sessioninfo     1.1.1       2018-11-05 [1]
#>  shiny           1.4.0       2019-10-10 [1]
#>  shinyjs         1.0         2018-01-08 [1]
#>  shinystan       2.5.0       2018-05-01 [1]
#>  shinythemes     1.1.2       2018-11-06 [1]
#>  SnowballC       0.6.0       2019-01-15 [1]
#>  StanHeaders     2.19.0      2019-09-07 [1]
#>  stringi         1.4.5       2020-01-11 [1]
#>  stringr         1.4.0       2019-02-10 [1]
#>  survival        3.1-8       2019-12-03 [1]
#>  testthat        2.3.1       2019-12-01 [1]
#>  threejs         0.3.1       2017-08-13 [1]
#>  tibble        * 2.1.3       2019-06-06 [1]
#>  tidymodels    * 0.0.3       2019-10-04 [1]
#>  tidyposterior   0.0.2       2018-11-15 [1]
#>  tidypredict     0.4.3       2019-09-03 [1]
#>  tidyr         * 1.0.2       2020-01-24 [1]
#>  tidyselect      1.0.0       2020-01-27 [1]
#>  tidytext        0.2.2.900   2019-10-19 [1]
#>  timeDate        3043.102    2018-02-21 [1]
#>  tokenizers      0.2.1       2018-03-29 [1]
#>  usethis         1.5.1.9000  2020-02-05 [1]
#>  vctrs           0.2.99.9005 2020-02-05 [1]
#>  withr           2.1.2.9000  2020-01-31 [1]
#>  workflows       0.0.0.9002  2019-12-21 [1]
#>  xfun            0.12        2020-01-13 [1]
#>  xtable          1.8-4       2019-04-21 [1]
#>  xts             0.11-2      2018-11-05 [1]
#>  yaml            2.2.1       2020-02-01 [1]
#>  yardstick     * 0.0.5.9000  2020-01-31 [1]
#>  zoo             1.8-6       2019-05-28 [1]
#>  source                               
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (r-lib/cli@e9f041e)           
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (r-lib/crayon@f4bc7b8)        
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  local                                
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (tidyverse/ggplot2@81ffdd0)   
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (yihui/knitr@ab191b0)         
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (tidymodels/parsnip@2e5d3fa)  
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (tidymodels/recipes@b40a0cf)  
#>  Github (r-lib/remotes@8d8d545)       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (rstudio/rstudioapi@eab7bcc)  
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (juliasilge/tidytext@525c1f7) 
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  Github (r-lib/usethis@ff34e40)       
#>  Github (r-lib/vctrs@9970a0b)         
#>  Github (r-lib/withr@16d47fd)         
#>  Github (tidymodels/workflows@305fe6a)
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  CRAN (R 3.6.0)                       
#>  local                                
#>  CRAN (R 3.6.0)                       
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

@fkgruber
Copy link
Author

fkgruber commented Feb 7, 2020

Thanks! that worked.
Yes I know R formula handling is always an issue with large number of variables. Is this something that you plan to improve in recipes?

@topepo
Copy link
Member

topepo commented May 1, 2020

Probably not. It isn't something that is usually encountered and formulas become very expensive with a large number of columns.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants