-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better performance with big datasets #277
Comments
For even better performance with big datasets I guess the option would be to use packages like ff (see bigdata2018 course materials, example of loading big kaggle data, r-bloggers blog post) or bigmemory (see bibmemory overview vignette from 2010) or some database format, such as RSQLite (see RSQLite vignette). I would personally prefer using a database approach. An experimental branch could be created for testing this approach but I would not include these features in 4.0.0. |
Big datasets with inadequate amounts of RAM seem to be quite torturous to the system. Tested with a cloud-based Windows 10 Enterprise with Intel Xeon 6242R and 6 GB of allocated RAM downloading and handling aforementioned dataset without data.table failed after 20 minutes to an error "cannot allocate vector size of 614.9 Mb". It was actually surprising that the process was able to go on as long as it did as it seemed to include quite a lot of swapping on HDD, reads and writes were pretty high when running the function. When attempting to use data.table functions it crashed R as well, but it crashed a lot quicker so maybe it wastes less of your time... Some potential improvements:
Some more general notes on the problems I encountered while working with a Virtual Machine / Windows:
Session info
|
Me and @ake123 ran tests with Handling big datasets scales pretty badly as "migr_asyappctzm" has only 80708892 (80 million) values whereas the biggest dataset in Eurostat "ef_lsk_main" has 148362539 (148 million) values and my computer with 16 GB of RAM was not able to handle the latter despite it having less than twice the amount of rows. Of course it might have more columns as well, I have not checked the data file contents that thoroughly. Using packages that allow for partial loading of the datasets (such as sqLite and monetDB) would seem like the next logical step to take. See Laura DeCicco's blog post on US Geological Survey website for benchmarks and comparisons of different approaches: Working with pretty big data in R |
There is still work to be done with this issue in subsequent releases but this issue has been solved with a minimum viable solution as of now. For further developments we will open a new issue. Closed with the CRAN release of package version 4.0.0. |
As mentioned in #98 big datasets can cause R to hang / crash. I also myself experienced this when running tests on weekly data (see issue #200 ), which is obviously bigger than monthly/yearly data.
I have been tested replacing some old functions with
data.table
based implementations onmigr_asyappctzm
dataset and got the following promising benchmarks (best case scenarios in bold):get_eurostat_raw
tidy_eurostat
saveRDS
So in the worst case scenario the whole processing took 4 min 20 sec whereas in the best case scenario it took 1 min 20 sec. Some seconds could be shaved off by not compressing the cache file or by turning cache off but I don't think it's worth it. The added benefit of using data.table and its
:=
operations was that no copies of the object in memory are made as operations are made in place:I think there might still be some quirks in utilising data.table as opposed to the old method so I've made data.table code optional for now. Feedback and suggestions very much welcome in this issue or the forthcoming PR I will link to this issue.
The text was updated successfully, but these errors were encountered: