File-backed data.tables #1336

zachmayer · 2015-09-16T18:55:18Z

SFrames are graphlab create's version of data.frames, and have some impressive performance benchmarks on single machines.

I'd really love to see something similar for data.table that could use disk rather than RAM to store the data.

arunsrinivasan · 2015-09-22T21:11:29Z

Agreed. Probably for v2.0.0.. depending on how much time and motivation we've.

jaapwalhout · 2018-02-12T08:37:48Z

The links in the original post of @zachmayer are not valid anymore. The GitHub repo of Graphlab/Dato/Turi can be found here. Because Graphlab/Dato/Turi has been acquired by Apple, this repo has been moved to here. It looks like it has evolved into a library for the development of machine learning models.

In case above two links stop working, I've created a fork in my own profile.

aquasync · 2019-01-21T07:38:38Z

One potential implementation strategy is via R's custom allocator mechanism. I constructed a file-backed data.table with individual columns backed by mmap-d files based on the code here.

See this gist, where I create the 2B row dataset (~75GB) from the benchmarks and run some aggregations on my laptop (16GB ram). There's many missing pieces that make this far from a user-friendly solution though. Among them: R's custom allocator is used for the entire array object, so there is an R implementation specific header prepended to the data; can't share even read-only between R sessions due to the former; can't hook data.table allocations for new objects (columns/indices) so they won't be memory-mapped; no support for real string columns; requires manual persistence of column attributes.

All those caveats aside, I've already found it to be quite useful when working with a large number of moderate sized datasets, where each is sequentially memory mapped, data.table is told they're already sorted (attr(DT, 'order') = ...) and then performing a "roll" join to extract data with a given lookback, such that the only the data needed for the binary search and the subsequent values needs to be read from disk.

jonekeat · 2020-12-30T07:06:27Z

Is something similar to what @aquasync proposed already implemented? I have tried to use mmap package to memory map each column in a list, then setDT, but it cannot work with data.table methods. I am looking for any alternatives before using databases/spark or rewrite into c/c++

jangorecki · 2020-12-30T07:31:24Z

@jonekeat disk.frame is possibly an alternative but I haven't tried it myself.

GitHunter0 · 2021-02-22T14:21:36Z

@jonekeat disk.frame is possibly an alternative but I haven't tried it myself.

disk.frame is the most promising R solution for this matter I've seen so far. It would be very interesting to see data.table and disk.frame contributors working together

r2evans · 2024-04-09T14:36:36Z

As a current-day workaround, what about the use of arrow::open_dataset and dtplyr or similar? The data is immutable so "saving" data would need to be an explicit step, but at least fast access to on-desk data should be feasible. (I recognize this does not fully address all likely use-cases for on-disk data.table operations, mostly a technique for mitigating large-data operations.)

tdhock · 2024-04-09T15:59:59Z

This is currently out of scope https://github.com/Rdatatable/data.table/blob/master/GOVERNANCE.md#the-r-package and I don't think anyone has the time/interest/skill to implement, so I'm closing.

r2evans · 2024-04-09T16:56:05Z

I don't disagree, it's definitely big-scope. I offered my comment to illustrate alternative paths.

MichaelChirico · 2024-04-09T17:04:44Z

This is currently out of scope https://github.com/Rdatatable/data.table/blob/master/GOVERNANCE.md#the-r-package and I don't think anyone has the time/interest/skill to implement, so I'm closing.

to clarify I'd be glad to have scope expanded for this high-demand FR, but as noted current maintainer core has no time/ability to support this. outside contributions (and commitment to ownership) welcome.

arunsrinivasan added the feature request label Sep 22, 2015

This comment has been minimized.

Sign in to view

arunsrinivasan mentioned this issue Sep 24, 2015

parallelise forder() #1360

Closed

This comment has been minimized.

Sign in to view

jangorecki mentioned this issue Oct 7, 2018

Working with out-of-memory data #3101

Closed

franknarf1 mentioned this issue Oct 11, 2018

Share data.table among R sessions by reference #3104

Open

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

76 tasks

This comment has been minimized.

Sign in to view

jangorecki mentioned this issue Jan 29, 2019

Integration with databases #726

Closed

This comment has been minimized.

Sign in to view

MichaelChirico mentioned this issue May 27, 2019

Big data solution #3606

Closed

MichaelChirico mentioned this issue Sep 5, 2019

working with out-of-memory data #3821

Closed

MichaelChirico mentioned this issue Apr 17, 2020

Any plans for out-of-memory data formats? #4384

Closed

MichaelChirico added the High label May 30, 2020

MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020

jangorecki mentioned this issue Feb 13, 2021

data.table 50gb join benchmarks #4900

Closed

jangorecki mentioned this issue May 6, 2022

Error on modifying by reference with data.table::set() in the context of future.apply::future_apply() or furrr::future_map() #5376

Open

tdhock mentioned this issue Nov 28, 2023

add GOVERNANCE.md document #5772

Merged

tdhock closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File-backed data.tables #1336

File-backed data.tables #1336

zachmayer commented Sep 16, 2015

arunsrinivasan commented Sep 22, 2015

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jaapwalhout commented Feb 12, 2018

aquasync commented Jan 21, 2019

This comment has been minimized.

This comment has been minimized.

jonekeat commented Dec 30, 2020

jangorecki commented Dec 30, 2020

GitHunter0 commented Feb 22, 2021

r2evans commented Apr 9, 2024 •

edited

Loading

tdhock commented Apr 9, 2024

r2evans commented Apr 9, 2024

MichaelChirico commented Apr 9, 2024

File-backed data.tables #1336

File-backed data.tables #1336

Comments

zachmayer commented Sep 16, 2015

arunsrinivasan commented Sep 22, 2015

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jaapwalhout commented Feb 12, 2018

aquasync commented Jan 21, 2019

This comment has been minimized.

This comment has been minimized.

jonekeat commented Dec 30, 2020

jangorecki commented Dec 30, 2020

GitHunter0 commented Feb 22, 2021

r2evans commented Apr 9, 2024 • edited Loading

tdhock commented Apr 9, 2024

r2evans commented Apr 9, 2024

MichaelChirico commented Apr 9, 2024

r2evans commented Apr 9, 2024 •

edited

Loading