Share data.table among R sessions by reference #3104

DeppLearning · 2018-10-11T14:43:00Z

I'm looking into ways of sharing a data.table among several R processes on the same maschine by reference, is there already one that I missed? I'm looking for functionality analogous to this:

https://www.rdocumentation.org/packages/bigmemory/versions/3.12/topics/describe%2C%20attach.big.matrix

Thank you for the great work on this package.

MichaelChirico · 2018-10-11T16:04:12Z

There might be some way to use address(DT)? And/or attr(DT, '.internal.selfref')?

DeppLearning · 2018-10-11T17:40:34Z

Thank you for the suggestion. I had similar attempts but after reading up I believe you simply can't directly, since these are externalptr objects which are of no use outside their corresponding R session.

franknarf1 · 2018-10-11T17:48:15Z

File-backed data.tables (#1336) might help (?), though I imagine they will have to be locked for editing/writing (like .SD is) under some conditions.

st-pasha · 2018-10-11T19:03:34Z

There is only one way to share memory across processes: one of the processes has to allocate a new shared memory region using mmap and give it a unique id. This id then have to be communicated to the other process through traditional means (such as pipes). The other process can then mmap the same memory region via the id that it received. If you think about these ids as file names, then the process is exactly equivalent to one process creating a temporary file, and the other process reading from that file; except that the file will be located in memory. Actually, it doesn't have to -- you can open and share a regular file instead.

Now, a difficult part is to have a data.table object living in a file. This is something that requires intricate knowledge of R internal API. How do you place a single R vector at a given memory address? How do you ensure that R will not attempt to resize or reclaim this vector? What to do with string columns, where each string is a reference into the global string cache (and the global string cache cannot be shared)?

These are all quite hard questions, and I don't know the answers. However, it is possible in principle. At least Python datatable solved exactly this problem successfully: a Frame can be saved into a .jay file in one process, then memory-map opened in any number of other processes, and the copy-on-write semantics ensures that no changes can be accidentally leaked.

shrektan · 2018-10-12T05:59:04Z

Moreover, how could the process B know whether the memory is still valid or not (rather than garbage), since the process A may have been killed already.

sritchie73 · 2018-10-12T09:47:10Z

@st-pasha and @shrektan , The bigmemory R package that @weltherrschaf references solves this problem for matrix objects (i.e. vectors with dimensions attached), so it can be done. Presumably the bigmemory internals could be extended to store multiple vectors (i.e. because a data.table is a data.frame, and a data.frame is really a list() of vectors).

There is one caveat to be aware of with this type of file-backed shared-memory objects: some (all?) HPC clusters with multiple nodes hate these. If you create a file-backed shared memory object and try to access it from multiple R sessions the object essentially locks up those processes due to the consistency checks made by the HPC filesystem (because of the possibility those processes might be spread over multiple nodes - even if you explicitly ask for the same node). This is something I had the (dis)pleasure of learning when trying to publish my NetRep R package in my PhD, after paper acceptance passing software review I discovered this problem and ended up having to rip out the internals and quickly learn C++ so I could parallelise the code (through casting to a C++ armadillo matrix and writing multithreaded code that operated on those C++ objects in shared memory).

sritchie73 · 2018-10-12T09:55:24Z

Actually we could probably just do something simple with the bigmemory package: write a function that converts each column to a big.matrix object, and another that can load those matrix/columns and wrap them in a data.table in your new R session. I might play with this over the weekend.

DeppLearning · 2018-10-12T12:57:52Z

I used bigmemory for a while and it's not ideal. Attaching a big.matrix can take quite a while. Additionally the package tends to accumulate temp files that you'll might have to clean up yourself once in a while.
I think I'd prefer something like feather (https://github.com/wesm/feather) which apparently uses dplyr, is file backed via the apache arrow format (https://arrow.apache.org/) and can be used from python and a bunch of other languages directly. feather with data.table instead of dplyr would be great.

ChristK · 2018-10-12T15:02:43Z

Since you mentioned feather, you may want to have a look at the fst package (https://github.com/fstpackage/fst) if you haven't already. The roadplan is promising fstpackage/fst#117

DeppLearning · 2018-10-12T16:05:10Z

fst is great, I used it in my last project and never had any issues with it. I didn't know their roadmap though. Feather/apache arrow is interesting due to it's promise to be able to share data by reference within quite a rich ecosystem of languages and services.

st-pasha · 2018-10-12T19:51:55Z

It is my understanding that both fst and feather provide fast serialization/deserialization to/from file, but do not allow a data.frame to actually exist in a file. This may be a reasonable alternative, although it is not true sharing of the data.

At the same time, arrow attempts to implement a true file-backed data frame, however, this will not be R data.frame (even for primitive types such as integer or numeric, Arrow's format is different from R's). As such, they'd probably need to re-implement all dataframe functionality from scratch...

MichaelChirico · 2018-10-15T07:33:36Z

Follow here for R implementation of arrow

https://github.com/romainfrancois/arrow

nbenn · 2018-10-29T14:12:31Z

@st-pasha

How do you place a single R vector at a given memory address?

The C API offers allocVector3() for this purpose.

How do you ensure that R will not attempt to resize or reclaim this vector?

Do you feel this is a problem? Properly PROTECTed objects shouldn't be gced or anything, I don't think. Do you have a specific problem in mind here? Or is this more a FUD-type statement?

What to do with string columns ... ?

This might be more of a headache. Maybe ALTREPs? Maybe this also yields speedups for other areas such as sort(), unique(), is.na() for character vectors as well?

I'm mostly name-checking here. Does anyone with intimate familiarity of the data.table internals have an opinion on feasibility of shared memory data.tables?

st-pasha · 2018-10-29T17:53:17Z

@nbenn Thanks, this hits the mark. If R has a custom memory allocator mechanism, then it will certainly know to call the user-provided custom de-allocator when the time comes.

nbenn · 2018-10-30T08:06:37Z

@sritchie73 can you shed more light on your experiences with file-backed shared-memory objects in HPC environments? I gather you had problems with file-backed bigmemory::big.matrix objects, not with shared memory in general? I just did a quick test (small scale: 1-3 cores) with non-file-backed bigmemory::big.matrix objects and HPC results (LSF managed cluster, CentOS 7, BL460c Gen9 nodes) are consistent with results obtained locally with a MacBook Pro.

I would not expect the file system to interfere with management of shared memory. Furthermore, for example for applying a function to a data.table in parallel group-by fashion, locking mechanisms are not neccessary at all, as it is guaranteed that writes are never to the same location.

franknarf1 · 2018-10-30T09:05:16Z

Furthermore, for example for applying a function to a data.table in parallel group-by fashion, locking mechanisms are not neccessary at all, as it is guaranteed that writes are never to the same location.

I guess this depends on i being empty or having no dupes, since you can do these group-by operations with overlapping groups:

library(data.table)
DT = data.table(id = 1:3)
mDT = data.table(id = c(1L, 2L, 2L, 3L), g = rep(1:2, each=2))

# writes to row 2 twice in a join
DT[mDT, on=.(id), g := i.g, by=.EACHI] 

# writes to row 2 twice with row number subset
DT[mDT$id, g := .BY[[1]], by=.(mDT$g)]

sritchie73 · 2018-10-30T09:34:59Z

@nbenn digging into my old emails, the file system was GPFS, something to do with a conflict with the way the Boost headers used by bigmemory implement the file-backed shared memory objects (docs here) and the way GPFS handles calls to mmap(). See also this thread on the R-sig-hpc mailing list: first email here, remaining emails in this thread - note the replies from Jay Emerson, one of the authors and maintainers of the bigmemory package.

From my limited understanding and experience, it seemed like the filesystem would lock I/O access to the file-backed shared memory objects if multiple processes were trying to access it. My understanding is this was the filesystem's way of ensuring consistency of files across multiple physical nodes. This problem was present whether or not you actually created a backing file on disk, or let bigmemory store that temporary file purely in memory. Explicitly requesting a single node from the SLURM scheduler also did not alleviate the issue.

The way I got around this was to move all my parallel code from R into C++. I wrote a multithreaded procedure, where each thread gained access to my large matrices via a pointer to each matrix passed to each thread. Use of shared memory in this way worked fine. However, this was a completely different problem to sharing objects across R sessions.

HikaGenji · 2020-02-03T21:41:27Z

what about using disk.frame ?

https://github.com/xiaodaigh/disk.frame

it supports most of dplyr verbs and data.table syntax

nbenn mentioned this issue Oct 29, 2018

[FR] gforce mean and sum in parallel #3042

Closed

nbenn mentioned this issue Oct 30, 2018

Parallel group-by for arbitrary R functions #3130

Open

jangorecki added the feature request label Jan 29, 2019

MichaelChirico mentioned this issue May 27, 2019

Big data solution #3606

Closed

MichaelChirico mentioned this issue Sep 5, 2019

working with out-of-memory data #3821

Closed

MichaelChirico mentioned this issue Apr 17, 2020

Any plans for out-of-memory data formats? #4384

Closed

MichaelChirico mentioned this issue May 14, 2021

Master list of most-requested issues #3189

Open

76 tasks

jangorecki mentioned this issue May 6, 2022

Error on modifying by reference with data.table::set() in the context of future.apply::future_apply() or furrr::future_map() #5376

Open

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share data.table among R sessions by reference #3104

Share data.table among R sessions by reference #3104

DeppLearning commented Oct 11, 2018

MichaelChirico commented Oct 11, 2018

DeppLearning commented Oct 11, 2018

franknarf1 commented Oct 11, 2018

st-pasha commented Oct 11, 2018

shrektan commented Oct 12, 2018

sritchie73 commented Oct 12, 2018

sritchie73 commented Oct 12, 2018

DeppLearning commented Oct 12, 2018 •

edited

Loading

ChristK commented Oct 12, 2018

DeppLearning commented Oct 12, 2018

st-pasha commented Oct 12, 2018

MichaelChirico commented Oct 15, 2018

nbenn commented Oct 29, 2018

st-pasha commented Oct 29, 2018

nbenn commented Oct 30, 2018

franknarf1 commented Oct 30, 2018

sritchie73 commented Oct 30, 2018

HikaGenji commented Feb 3, 2020

Share data.table among R sessions by reference #3104

Share data.table among R sessions by reference #3104

Comments

DeppLearning commented Oct 11, 2018

MichaelChirico commented Oct 11, 2018

DeppLearning commented Oct 11, 2018

franknarf1 commented Oct 11, 2018

st-pasha commented Oct 11, 2018

shrektan commented Oct 12, 2018

sritchie73 commented Oct 12, 2018

sritchie73 commented Oct 12, 2018

DeppLearning commented Oct 12, 2018 • edited Loading

ChristK commented Oct 12, 2018

DeppLearning commented Oct 12, 2018

st-pasha commented Oct 12, 2018

MichaelChirico commented Oct 15, 2018

nbenn commented Oct 29, 2018

st-pasha commented Oct 29, 2018

nbenn commented Oct 30, 2018

franknarf1 commented Oct 30, 2018

sritchie73 commented Oct 30, 2018

HikaGenji commented Feb 3, 2020

DeppLearning commented Oct 12, 2018 •

edited

Loading