GPU memory error in huge data #3

ochiken-A1772 · 2024-12-01T13:18:31Z

Thanks for your fantastic work.

I am attempting to run space on huge Xenium data.

However, the input size is too large and the model.encode function in train.py is giving me an error in GPU memory allocation.

Error output:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.43 GiB. GPU 0 has a total capacity of 47.44 GiB of which 35.40 GiB is free. Including non-PyTorch memory, this process has 12.03 GiB memory in use. Of the allocated memory
11.71 GiB is allocated by PyTorch, and 14.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation f
or Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I have 4 GPUs with 48GB memory in my runtime environment, which is not enough for this process.
The size of the tensor input to model.encode is 1534691✕330 for x and 2✕8804994 for edge_index.
I have tried to rewrite some of the scripts for this data, but without any knowledge of torch, I am not able to get it to work.

Any ideas would be appreciated.

ericli0419 · 2024-12-03T09:51:46Z

Thank you for your interest in our work and sorry for the delayed response. I think your Xenium dataset comprises 1,534,691 cells and 330 genes. This exceeds the current capacity of the SPACE model, which, due to its full-batch training regimen.

I think a simple compromise strategy at the moment is to slice the data into regions of about 80,000 cells based on spatial location and train them separately.

We have noticed the limitations of the SPACE model in terms of the number of cells it can handle, and we are working on a new model that can handle spatial transcriptome data of more than 10 million cells, and perhaps after a while we can make a public version available for people to try out.

ochiken-A1772 · 2024-12-08T06:16:51Z

Thanks for replying.
And thanks for suggesting the compromise strategy.

What do you think on modifying the learning method in the function in model.encode if we try to process the data without trimming it?
For example, I would like to reduce the memory requirement by breaking the learning process into smaller batches and then marge the results.
From a methods perspective, would you expect the output results to be significantly different?

I am not a familiar with libraries such as pytorch, etc., so I would appreciate your opinion on these ideas.

Finally, thank you for your kind response to my vague question.

ericli0419 · 2024-12-08T06:54:32Z

I tnink it is feasible to incorporate a mini-batch strategy into the training of the SPACE model. I anticipate that the outcomes will not be substantially different. However, generating a suitable mini-batch may necessitate empirical testing, as it involves segmenting the entire graph into smaller subsets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory error in huge data #3

GPU memory error in huge data #3

ochiken-A1772 commented Dec 1, 2024

ericli0419 commented Dec 3, 2024

ochiken-A1772 commented Dec 8, 2024

ericli0419 commented Dec 8, 2024

GPU memory error in huge data #3

GPU memory error in huge data #3

Comments

ochiken-A1772 commented Dec 1, 2024

ericli0419 commented Dec 3, 2024

ochiken-A1772 commented Dec 8, 2024

ericli0419 commented Dec 8, 2024