Error on modifying by reference with `data.table::set()` in the context of `future.apply::future_apply()` or `furrr::future_map()` #5376

ramiromagno · 2022-05-04T18:14:50Z

Hi,

First of all, let me thank you for the development amazing {data.table} package.

My case is that I have a list of data tables that I am trying to modify by reference with data.table::set() inside a loop using future.apply::future_apply() and furrr::future_walk()/furrr::future_map().

However I am getting an error when using future.apply::future_apply() or furrr::future_walk()/furrr::future_map(). It works fine with lapply() although.

I am not sure the problem is with the {data.table} package itself... I will post this same issue in {furrr} and {future.apply} Issues, and link it here.

The error is:

Error in data.table::set(snp_pairs, i = i, j = col, value = df[[col]]) : 
  This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.

You will need to install {daeqtlr} first:

remotes::install_github("maialab/daeqtlr")

library(future.apply)
library(furrr)
# For now install from https://github.com/maialab/daeqtlr
library(daeqtlr)

plan(multisession)

snp_pairs <- read_snp_pairs(file = daeqtlr_example("snp_pairs.csv"))
zygosity <- read_snp_zygosity(file = daeqtlr_example("zygosity.csv"))
ae <- read_ae_ratios(file = daeqtlr_example("ae.csv"))

no_cores <- 6L
indices <- seq_len(nrow(snp_pairs))
partitioning_factor <- sort((indices)%%no_cores) + 1
snp_pairs_lst1 <- split(snp_pairs, partitioning_factor)
snp_pairs_lst2 <- split(snp_pairs, partitioning_factor)
snp_pairs_lst3 <- split(snp_pairs, partitioning_factor)

for( i in seq_along(snp_pairs_lst1)) {
  data.table::setkeyv(snp_pairs_lst1[[i]], 'dae_snp')
  data.table::setkeyv(snp_pairs_lst2[[i]], 'dae_snp')
  data.table::setkeyv(snp_pairs_lst3[[i]], 'dae_snp')
}

# Runs fine without errors.
lapply(snp_pairs_lst1,
              FUN = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)

# Fails with error:
# 
# Error in data.table::set(snp_pairs, i = i, j = col, value =
# df[[col]]) : This data.table has either been loaded from disk (e.g. using
# readRDS()/load()) or constructed manually (e.g. using structure()). Please run
# setDT() or setalloccol() on it first (to pre-allocate space for new columns)
# before assigning by reference to it.
future_lapply(snp_pairs_lst2,
              FUN = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)

# Fails with the same error as `future_lapply`
# It won't work with `future_map` either.
future_walk(snp_pairs_lst3,
              .f = daeqtl_mapping,
              zygosity = zygosity,
              ae = ae)

The text was updated successfully, but these errors were encountered:

ramiromagno · 2022-05-04T20:36:28Z

After fiddling around, it seems that including

  n <- nrow(snp_pairs)
  # `setalloccol` is needed because of `future.apply::future_lapply()`,
  # otherwise https://github.com/Rdatatable/data.table/issues/5376.
  data.table::setalloccol(snp_pairs, extra_cols*n)

inside the source code of the mapped function, i.e. daeqtl_mapping() makes the future_lapply() call work, i.e. run without errors. However, it does not change the data table snp_pairs_lst2 in-place as lapply() does with snp_pairs_lst1.

ben-schwen · 2022-05-04T21:19:10Z

I have no idea how the internals of future.apply work but for parallel computing, you basically have to copy the objects you want to modify to your worker nodes.
This would at least explain the

This data.table has either been loaded from disk (e.g. using # readRDS()/load()) or constructed manually (e.g. using structure()).

Depending on how the serialization of future.apply works there might be a way to provide custom serialization/deserialization although I'm not sure if that's really something future.apply wants to achieve.

That the in-place change does not work after fixing the setalloccol problem is also clear, since you are modifying the data.table on your worker nodes and have to write them back at some point.

ben-schwen · 2022-05-05T06:13:05Z

Also related to #5269 which caters for the call to setalloccol.

ramiromagno · 2022-05-05T08:45:53Z

Without a call to setalloccol() I realize now that truelength(x) returns 0 inside the mapped function. Introducing a call to setalloccol() therein adds the right extra number of columns needed for set() to work without problems.

jangorecki · 2022-05-06T07:40:25Z

If future.apply requires copy of data in your session then modification in-place will naturally not be possible. Unless you can pass a reference to an object I don't think there is a workaround for it. See related issues #3104 and #1336.

HenrikBengtsson · 2022-11-13T18:24:44Z

Author of futureverse here: FWIW, any type of parallel backends can be used in the future, e.g. forked parallelization via the mclapply() framework, background R processes via PSOCK cluster, background R process via the callr package, etc. So, it's parallelization business as usual. This also means that one cannot make assumptions of running with shared memory or what type of serialization is used.

It sounds like the problem here is related to the general problem of serializing a data.table object and re-using it in another R process (concurrently or later in time).

iago-pssjd · 2022-12-05T09:53:58Z

May this issue be related to the fact that updating data.table by reference using := inside a foreach loop does not seem to work?

HenrikBengtsson · 2022-12-05T11:40:10Z

Yes, same problem if you run foreach in parallel. You can update a data.table in a parallel worker, but you cannot expect the update to be updated in the main R session.

bschilder mentioned this issue Mar 21, 2023

Parallelising in R efficiently neurogenomics/MSTExplorer#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on modifying by reference with `data.table::set()` in the context of `future.apply::future_apply()` or `furrr::future_map()` #5376

Error on modifying by reference with `data.table::set()` in the context of `future.apply::future_apply()` or `furrr::future_map()` #5376

ramiromagno commented May 4, 2022

ramiromagno commented May 4, 2022 •

edited

Loading

ben-schwen commented May 4, 2022 •

edited

Loading

ben-schwen commented May 5, 2022

ramiromagno commented May 5, 2022

jangorecki commented May 6, 2022

HenrikBengtsson commented Nov 13, 2022

iago-pssjd commented Dec 5, 2022 •

edited

Loading

HenrikBengtsson commented Dec 5, 2022

Error on modifying by reference with data.table::set() in the context of future.apply::future_apply() or furrr::future_map() #5376

Error on modifying by reference with data.table::set() in the context of future.apply::future_apply() or furrr::future_map() #5376

Comments

ramiromagno commented May 4, 2022

ramiromagno commented May 4, 2022 • edited Loading

ben-schwen commented May 4, 2022 • edited Loading

ben-schwen commented May 5, 2022

ramiromagno commented May 5, 2022

jangorecki commented May 6, 2022

HenrikBengtsson commented Nov 13, 2022

iago-pssjd commented Dec 5, 2022 • edited Loading

HenrikBengtsson commented Dec 5, 2022

Error on modifying by reference with `data.table::set()` in the context of `future.apply::future_apply()` or `furrr::future_map()` #5376

Error on modifying by reference with `data.table::set()` in the context of `future.apply::future_apply()` or `furrr::future_map()` #5376

ramiromagno commented May 4, 2022 •

edited

Loading

ben-schwen commented May 4, 2022 •

edited

Loading

iago-pssjd commented Dec 5, 2022 •

edited

Loading