Memory-efficient posterior generation #263

sjfleming · 2023-08-29T13:36:44Z

It has become apparent that something during the posterior generation process in v0.3.0 is gobbling up way too much memory, more than previous versions. See #251 #248

Conceptually, in v2:

for each minibatch
- compute the posterior
- estimating the noise counts
- keep only the noise counts

Conceptually, in v3:

for each minibatch
- compute the posterior
- keep a sparse representation in memory
save the full sparse posterior as an h5 file
use an "estimator" applied to the full sparse posterior

This refactor allows us to do a whole lot more. But it also involves computing and saving the full posterior, which was not attempted in v2. While it is perfectly doable (these posterior h5 files are usually less than 2GB), it seems it needed to be done a bit more carefully.

I think the extension of python lists left around objects in memory (by creating references to them) that I did not intend.

Adopting another strategy: keep a python list of (sparsified info as) torch tensors. Append tensors to the lists each minibatch. Concatenate them once and for all at the end. All these tensors are cloned from the originals, detached, and kept in cpu memory.

Closes #248
Closes #251

jg9zk · 2023-08-29T18:22:09Z

This branch stopped memory from being used up during the for loop, but my job was killed due to OOM sometime after. I'm running on 140 gb, which should be plenty

sjfleming · 2023-08-29T18:51:42Z

@jg9zk thanks for reporting. Any chance you could post the last few lines of the log file?

jg9zk · 2023-08-29T18:58:04Z

cellbender:remove-background: Working on chunk (377/383)
cellbender:remove-background: Working on chunk (378/383)
cellbender:remove-background: Working on chunk (379/383)
cellbender:remove-background: Working on chunk (380/383)
cellbender:remove-background: Working on chunk (381/383)
cellbender:remove-background: Working on chunk (382/383)
cellbender:remove-background: Working on chunk (383/383)
Killed

jg9zk · 2023-08-29T19:43:08Z

OOM seems to occur in either line 549 or 550 of posterior.py in commit 6fd8c23 (noise_offset_dict creation)

sjfleming · 2023-08-30T02:36:18Z

I was able to reproduce that same behavior @jg9zk

* Speed up MCKP _gene_chunk_iterator() by a factor of 100

jg9zk · 2023-08-31T13:04:15Z

I tried commit 7fd0ac and it completed! However, it looks like counts are being added to the count matrix instead of removed, but I'll open a separate issue about that.

* Add WDL input to set number of retries. (#247) * Move hash computation so that it is recomputed on retry, and now-invalid checkpoint is not loaded. (#258) * Bug fix for WDL using MTX input (#246) * Memory-efficient posterior generation (#263) * Fix posterior and estimator integer overflow bugs on Windows (#259) * Move from setup.py to pyproject.toml (#240) * Fix bugs with report generation across platforms (#302) --------- Co-authored-by: kshakir <github@kshakir.org> Co-authored-by: alecw <alecw@users.noreply.github.com>

sjfleming added 2 commits August 25, 2023 14:37

dense_to_sparse_op_torch outputs torch tensors

16c7f72

List of tensors gets concatenated once at the end

86a0c90

sjfleming added the enhancement New feature or improvement label Aug 29, 2023

sjfleming mentioned this pull request Aug 29, 2023

OOM Killed during batch processing when running cellbender on slurm #251

Open

Fix cuda error with indices on gpu

b98b20b

Fix cuda issues in test design

1a3ce15

Remove commented-out code

6fd8c23

Attempt to make the noise offset dict less memory intensive

89abac9

sjfleming and others added 2 commits August 29, 2023 23:21

Fix index error on cuda

8be31d2

MCKP gene iterator speedup (#264)

7fd0dac

* Speed up MCKP _gene_chunk_iterator() by a factor of 100

sjfleming mentioned this pull request Aug 31, 2023

reuse of checkpoint file #266

Open

sjfleming marked this pull request as ready for review October 20, 2023 18:28

sjfleming requested a review from mbabadi October 20, 2023 18:28

mbabadi approved these changes Oct 24, 2023

View reviewed changes

sjfleming merged commit 322971d into dev Oct 31, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-efficient posterior generation #263

Memory-efficient posterior generation #263

sjfleming commented Aug 29, 2023

jg9zk commented Aug 29, 2023

sjfleming commented Aug 29, 2023

jg9zk commented Aug 29, 2023

jg9zk commented Aug 29, 2023

sjfleming commented Aug 30, 2023

jg9zk commented Aug 31, 2023

Memory-efficient posterior generation #263

Memory-efficient posterior generation #263

Conversation

sjfleming commented Aug 29, 2023

jg9zk commented Aug 29, 2023

sjfleming commented Aug 29, 2023

jg9zk commented Aug 29, 2023

jg9zk commented Aug 29, 2023

sjfleming commented Aug 30, 2023

jg9zk commented Aug 31, 2023