Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better performance with big datasets #277

Closed
pitkant opened this issue Sep 29, 2023 · 4 comments · Fixed by #278
Closed

Better performance with big datasets #277

pitkant opened this issue Sep 29, 2023 · 4 comments · Fixed by #278
Assignees

Comments

@pitkant
Copy link
Member

pitkant commented Sep 29, 2023

As mentioned in #98 big datasets can cause R to hang / crash. I also myself experienced this when running tests on weekly data (see issue #200 ), which is obviously bigger than monthly/yearly data.

I have been tested replacing some old functions with data.table based implementations on migr_asyappctzm dataset and got the following promising benchmarks (best case scenarios in bold):

get_eurostat_raw

  • readr::read_tsv: about 1 minute 20 seconds
  • data.table::fread about 11 seconds

tidy_eurostat

  • tidyr::pivot_longer, dplyr::filter, gsub... 1 min 23 sec
  • data.table melt, na.omit, stringi::stri_replace_all_fixed... 45 sec

saveRDS

  • compress = TRUE, stringsAsFactors = FALSE 1 min 40 sec
  • compress = TRUE, stringsAsFactors = TRUE 24 sec (cache file size 15,6 Mb)
  • compress = FALSE, stringsAsFactors = FALSE 1 min 15 sec (cache file size 7,61 Gb)
  • compress =FALSE, stringsAsFactors = TRUE 8 sec (cache file size 3,54 Gb)

So in the worst case scenario the whole processing took 4 min 20 sec whereas in the best case scenario it took 1 min 20 sec. Some seconds could be shaved off by not compressing the cache file or by turning cache off but I don't think it's worth it. The added benefit of using data.table and its := operations was that no copies of the object in memory are made as operations are made in place:

A shallow copy is just a copy of the vector of column pointers (corresponding to the columns in a data.frame or data.table). The actual data is not physically copied in memory.

A deep copy on the other hand copies the entire data to another location in memory.

With data.table’s := operator, absolutely no copies are made in both (1) and (2), irrespective of R version you are using. This is because := operator updates data.table columns in-place (by reference).

I think there might still be some quirks in utilising data.table as opposed to the old method so I've made data.table code optional for now. Feedback and suggestions very much welcome in this issue or the forthcoming PR I will link to this issue.

@pitkant
Copy link
Member Author

pitkant commented Oct 2, 2023

For even better performance with big datasets I guess the option would be to use packages like ff (see bigdata2018 course materials, example of loading big kaggle data, r-bloggers blog post) or bigmemory (see bibmemory overview vignette from 2010) or some database format, such as RSQLite (see RSQLite vignette). I would personally prefer using a database approach.

An experimental branch could be created for testing this approach but I would not include these features in 4.0.0.

@pitkant
Copy link
Member Author

pitkant commented Oct 2, 2023

Big datasets with inadequate amounts of RAM seem to be quite torturous to the system. Tested with a cloud-based Windows 10 Enterprise with Intel Xeon 6242R and 6 GB of allocated RAM downloading and handling aforementioned dataset without data.table failed after 20 minutes to an error "cannot allocate vector size of 614.9 Mb". It was actually surprising that the process was able to go on as long as it did as it seemed to include quite a lot of swapping on HDD, reads and writes were pretty high when running the function. When attempting to use data.table functions it crashed R as well, but it crashed a lot quicker so maybe it wastes less of your time...

Some potential improvements:

  • Should users be warned if the dataset gzip file is larger than some pre-set threshold? What would that threshold be?
  • Should this warning include a "Do you wish to continue?" style prompt?
  • Should we have the option to handle the dataset in chunks?

Some more general notes on the problems I encountered while working with a Virtual Machine / Windows:

  • RTools is required for running CMD commands on Windows. This is necessary for example for gzip. Maybe this requirement could be circumvented somehow.
  • I attempted to use my home folder as the working directory but on the virtual machine it was in the form of "\path.to.somewhere\username\R_files" which produced an error message "UNC paths are not supported". I used the system Downloads folder as a temporary working folder.
Session info
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=Finnish_Finland.utf8 
[2] LC_CTYPE=Finnish_Finland.utf8   
[3] LC_MONETARY=Finnish_Finland.utf8
[4] LC_NUMERIC=C                    
[5] LC_TIME=Finnish_Finland.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] eurostat_4.0.0.9005

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0   httr2_0.2.3        remotes_2.4.2.1   
 [4] purrr_1.0.2        generics_0.1.3     vctrs_0.6.3       
 [7] utf8_1.2.3         rlang_1.1.1        pkgbuild_1.3.1    
[10] e1071_1.7-13       pillar_1.9.0       glue_1.6.2        
[13] withr_2.5.1        rappdirs_0.3.3     bit64_4.0.5       
[16] readxl_1.4.3       lifecycle_1.0.3    plyr_1.8.8        
[19] stringr_1.5.0      cellranger_1.1.0   callr_3.7.1       
[22] tzdb_0.4.0         ps_1.7.1           parallel_4.2.1    
[25] curl_4.3.2         class_7.3-20       fansi_1.0.4       
[28] Rcpp_1.0.11        KernSmooth_2.23-20 readr_2.1.4       
[31] backports_1.4.1    classInt_0.4-10    vroom_1.6.3       
[34] jsonlite_1.8.7     bit_4.0.5          countrycode_1.5.0 
[37] hms_1.1.3          digest_0.6.33      stringi_1.7.12    
[40] processx_3.7.0     dplyr_1.1.3        rprojroot_2.0.3   
[43] bibtex_0.5.1       here_1.0.1         ISOweek_0.6-2     
[46] cli_3.6.1          tools_4.2.1        magrittr_2.0.3    
[49] proxy_0.4-27       tibble_3.2.1       RefManageR_1.4.0  
[52] tidyr_1.3.0        crayon_1.5.2       pkgconfig_2.0.3   
[55] regions_0.1.8      data.table_1.14.8  xml2_1.3.5        
[58] prettyunits_1.2.0  timechange_0.2.0   lubridate_1.9.3   
[61] assertthat_0.2.1   httr_1.4.7         rstudioapi_0.13   
[64] R6_2.5.1           compiler_4.2.1

@pitkant
Copy link
Member Author

pitkant commented Oct 6, 2023

Me and @ake123 ran tests with migr_asyappctzm dataset and were able to handle it (albeit slowly) even on a Mac with 8 GB of RAM with use.data.table = TRUE or with use.data.table = FALSE. However, it did not work with stringsAsFactors = FALSE so keeping that on is advisable.

Handling big datasets scales pretty badly as "migr_asyappctzm" has only 80708892 (80 million) values whereas the biggest dataset in Eurostat "ef_lsk_main" has 148362539 (148 million) values and my computer with 16 GB of RAM was not able to handle the latter despite it having less than twice the amount of rows. Of course it might have more columns as well, I have not checked the data file contents that thoroughly.

Using packages that allow for partial loading of the datasets (such as sqLite and monetDB) would seem like the next logical step to take. See Laura DeCicco's blog post on US Geological Survey website for benchmarks and comparisons of different approaches: Working with pretty big data in R

@pitkant
Copy link
Member Author

pitkant commented Dec 20, 2023

There is still work to be done with this issue in subsequent releases but this issue has been solved with a minimum viable solution as of now. For further developments we will open a new issue. Closed with the CRAN release of package version 4.0.0.

@pitkant pitkant closed this as completed Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant