Parallelising in R efficiently #8

bschilder · 2023-03-21T19:17:55Z

Parallelising over groups in R (e.g. each phenotype's gene list) can use a ton of memory, because you can accidentally copy all objects used inside the loop, which multiplies your memory usage when working with large datasets.

gene_data <- HPOExplorer::load_phenotype_to_genes()
gene_data[,HPO_ID.LinkID:=paste(HPO_ID,LinkID,sep=".")]
nrow(gene_data) # 968,116
length(unique(gene_data$HPO_ID.LinkID)) #621,906

Even on large-memory machines like the Threadripper (250Gb) this quickly gets overloaded bc each parallel process uses 17.6Gb of memory to store the entire gene_data obj over and over again.
17.6 * 50 cores = 835Gb of memory! This means that parallelising chokes up the threadripper's memory and slows it down to a grinding halt, making processing the data even slower than if you had just single-threaded it. This is exactly what happened when i tried this version of the gen_overlap func:
https://github.com/neurogenomics/MultiEWCE/blob/25d26a41096902607a4f595343f2f585dad9f819/R/gen_overlap.R#L49

A better way might be to split the data first into chunks, and then iterate over the chunks:

 split.data.table <- utils::getFromNamespace("split.data.table","data.table")
gene_data_split <- split.data.table(x = gene_data,
                                    by ="HPO_ID.LinkID")

Killing zombie processes

This happens when you launch some paralleised func in R and then decide to stop it midway. A bunch of "zombie" processes are leftover.

Restarting the R session

Sometimes this works, other times not so much. Might not work when memory is so overloaded that you can't restart the R session.

via htop

Didn't seem to work.

Via `inline`

Someone suggested this, but didn't seem to do anything for me on the Threadripper.

https://stackoverflow.com/questions/25388139/r-parallel-computing-and-zombie-processes

via `future`

Using future instead of parallel might give me more control over this, but it has to done before launching the processes.
HenrikBengtsson/future#93

via Docker

To kill them, the only effective way I've found is to restart the container:

docker stop <id>
docker start <id>

@Al-Murphy probably relevant to you too

The text was updated successfully, but these errors were encountered:

bschilder · 2023-03-21T19:32:58Z

Splitting did not help, still caused a huge explosion of memory usage.

Found these tutorials on parallelising in R:
https://yxue-me.com/post/2019-05-12-a-glossary-of-parallel-computing-packages-in-r-2019/
https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html
https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html

future tutorial:
https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html

Forking actually copies the entire R environment, not just the variables inside each loop.

bschilder · 2023-03-21T19:46:50Z

Packages:

parallel
future
future.apply
BiocParallel
clusterApply
furrr
foreach

bschilder · 2023-03-21T20:49:55Z

furrr seems to have some useful methods, like future_map2 which has a built in progress bar and works directly with future:
https://furrr.futureverse.org/
http://zevross.com/blog/2019/02/12/dramatically-speed-up-your-r-purrr-functions-with-the-furrr-package/

bschilder · 2023-03-21T21:17:41Z

Testing methods

1. `parallel::mclapply`

This is the least efficient method, as I include the gene_data_split inside the loop which increases the memory usage by ~1Gb every 2 seconds or so.

  overlap <- parallel::mclapply(seq_len(length(gene_data_split)),
                                function(i){
    list_name <- names(gene_data_split)[[i]]

    d <- gene_data_split[[i]]
    gen_overlap_test(ct_genes = ct_genes,
                     list_name = list_name,
                     dgenes = unique(d$Gene),
                     long_format = long_format,
                     bg = bg,
                     verbose = FALSE)
  }, mc.cores = cores, mc.) |>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column)

2. `furrr::future_map2`

This is more efficient due to not passing gene_data_split into the loop. But memory usage is still increasing by ~.3Gb every 2 seconds or so.

 future::plan(future::multisession, workers = cores)
  overlap <- furrr::future_map2(.x = gene_data_split,
                     .y = names(gene_data_split),
                     .progress = TRUE,
                     .f = function(.x,.y){
                       gen_overlap_test(ct_genes = ct_genes,
                                        list_name = .y,
                                        dgenes = unique(.x$Gene),
                                        long_format = long_format,
                                        bg = bg,
                                        verbose = FALSE)
                     })|>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column)

3. `furrr::future_map`

Only passing in .x, and not .y. So is a bit more efficient than future_map. Increases mem usage by ~.1-.3Gb/2s.

 future::plan(future::multisession, workers = cores)
  overlap <- furrr::future_map(.x = gene_data_split,
                     .progress = TRUE,
                     .f = function(.x){
                       gen_overlap_test(ct_genes = ct_genes,
                                        dgenes = unique(.x$Gene),
                                        long_format = long_format,
                                        bg = bg,
                                        verbose = FALSE)
                     })|>
    data.table::rbindlist(use.names = TRUE,
                          idcol = list_name_column)

Further improvements in efficiency were achieved through reducing the size of the ct_genes obj (only store gene names, not quantiles), and not storing the union gene set in the returned results object.
This gets the memory usage to only grow at ~.1Gb/5-14 seconds. Much better! At this rate, I might actually be able to finish all iterations before memory maxes out at >252Gb. I also noticed this rate tends to get a bit slower as time goes on.

4. `data.table`

A non-parallel option that is still very fast and does not consume increasing amounts of mem over time.
Takes advantage of the by and performs the grouped function calls within-object, which is more efficient that writing an lapply across a list of chunks.

overlap <- gene_data[,gen_overlap_test(ct_genes = ct_genes,
                                       list_name = get(list_name_column),
                                       dgenes = Gene,
                                       long_format = long_format,
                                       bg = bg),
                     by=c(list_name_column)]

Parallelising grouped operating within a data.table has been discussed as well:
Rdatatable/data.table#5376
Rdatatable/data.table#5077
Rdatatable/data.table#4284
Rdatatable/data.table#4077

It's also worth noting that you can use keyby instead of by. This seems to have more to do with how the table is ordered afterwards (via setting the key) than any performance diffs when doing grouped operations:
https://stackoverflow.com/questions/61755966/data-table-summarizing-data-difference-between-by-and-keyby
Rdatatable/data.table#5602

Killing zombies

Using future-based packages like furrr opens up the possibility of freeing up the zombified cores by simply running this command. This means i don't have to restart my entire R session each time!

fut <- future::plan(future::multisession, workers = cores)

future::resetWorkers(fut)

This even works if I forgot to save the plan as a variable

future::plan(future::multisession, workers = cores)

future::resetWorkers([fut](future::plan(future::multisession, workers = cores)))

Shared memory

Another thing to keep in mind here is that the threadripper is different from other kinds of clusters in that it is a shared memory resource. This complicates things and more care needs to be taken to avoid issues due to this.

bschilder self-assigned this Mar 21, 2023

bschilder added enhancement New feature or request documentation Improvements or additions to documentation labels Mar 21, 2023

bschilder closed this as completed Mar 22, 2023

bschilder mentioned this issue Mar 22, 2023

Filter based on disease-level neurogenomics/RareDiseasePrioritisation#18

Closed

bschilder added this to the Publish rare disease celltyping manuscript milestone Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelising in R efficiently #8

Parallelising in R efficiently #8

bschilder commented Mar 21, 2023 •

edited

Loading

bschilder commented Mar 21, 2023 •

edited

Loading

bschilder commented Mar 21, 2023

bschilder commented Mar 21, 2023

bschilder commented Mar 21, 2023 •

edited

Loading

Parallelising in R efficiently #8

Parallelising in R efficiently #8

Comments

bschilder commented Mar 21, 2023 • edited Loading

Related posts

Killing zombie processes

Restarting the R session

via htop

Via inline

via future

via Docker

bschilder commented Mar 21, 2023 • edited Loading

bschilder commented Mar 21, 2023

bschilder commented Mar 21, 2023

bschilder commented Mar 21, 2023 • edited Loading

Testing methods

1. parallel::mclapply

2. furrr::future_map2

3. furrr::future_map

4. data.table

Killing zombies

Shared memory

bschilder commented Mar 21, 2023 •

edited

Loading

Via `inline`

via `future`

bschilder commented Mar 21, 2023 •

edited

Loading

bschilder commented Mar 21, 2023 •

edited

Loading

1. `parallel::mclapply`

2. `furrr::future_map2`

3. `furrr::future_map`

4. `data.table`