Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory error in huge data #3

Open
ochiken-A1772 opened this issue Dec 1, 2024 · 3 comments
Open

GPU memory error in huge data #3

ochiken-A1772 opened this issue Dec 1, 2024 · 3 comments

Comments

@ochiken-A1772
Copy link

Thanks for your fantastic work.

I am attempting to run space on huge Xenium data.

However, the input size is too large and the model.encode function in train.py is giving me an error in GPU memory allocation.

Error output:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.43 GiB. GPU 0 has a total capacity of 47.44 GiB of which 35.40 GiB is free. Including non-PyTorch memory, this process has 12.03 GiB memory in use. Of the allocated memory
11.71 GiB is allocated by PyTorch, and 14.38 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation f
or Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I have 4 GPUs with 48GB memory in my runtime environment, which is not enough for this process.
The size of the tensor input to model.encode is 1534691✕330 for x and 2✕8804994 for edge_index.
I have tried to rewrite some of the scripts for this data, but without any knowledge of torch, I am not able to get it to work.

Any ideas would be appreciated.

@ericli0419
Copy link
Collaborator

Thank you for your interest in our work and sorry for the delayed response. I think your Xenium dataset comprises 1,534,691 cells and 330 genes. This exceeds the current capacity of the SPACE model, which, due to its full-batch training regimen.

I think a simple compromise strategy at the moment is to slice the data into regions of about 80,000 cells based on spatial location and train them separately.

We have noticed the limitations of the SPACE model in terms of the number of cells it can handle, and we are working on a new model that can handle spatial transcriptome data of more than 10 million cells, and perhaps after a while we can make a public version available for people to try out.

@ochiken-A1772
Copy link
Author

Thanks for replying.
And thanks for suggesting the compromise strategy.

What do you think on modifying the learning method in the function in model.encode if we try to process the data without trimming it?
For example, I would like to reduce the memory requirement by breaking the learning process into smaller batches and then marge the results.
From a methods perspective, would you expect the output results to be significantly different?

I am not a familiar with libraries such as pytorch, etc., so I would appreciate your opinion on these ideas.

Finally, thank you for your kind response to my vague question.

@ericli0419
Copy link
Collaborator

I tnink it is feasible to incorporate a mini-batch strategy into the training of the SPACE model. I anticipate that the outcomes will not be substantially different. However, generating a suitable mini-batch may necessitate empirical testing, as it involves segmenting the entire graph into smaller subsets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants