-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with h5py
from pip
and conversion to torch.Tensor
#215
Comments
SliceDataset
Hello @834799106, thanks for putting an issue here. Based on your error I doubt the In order to verify a memory leak we will need you to give us a reproducible example for your case since you're not using the PyTorch Lightning modules. Also, please let us know what version of PyTorch you are using and any information you have on the memory usage throughout an epoch. Note: high memory at the start might be expected, as you have your model in memory. There is also some metadata about the dataset that is precomputed and stored in memory. |
Hi @mmuckley I was about to file an issue for a memory leak. I'm not sure about the issue of @834799106 though. ` mask_func = create_mask_for_mask_type( root_gt = "/data/project/fastMRI/Brain/multicoil_train" sd = SliceDataset(root=root_gt, challenge="multicoil", for e in tqdm(dl): While running this code, I was monitoring the memory usage, even though I'm deleting the varible, it is still increasing the memory usage constantly. Originally, this was part of my other pipeline where I'm only using the SliceDataset and not the whole Lightning module. If you would like to have a look, this is the code: https://github.com/soumickmj/NCC1701/blob/main/Engineering/datasets/fastMRI.py I was originally thinking maybe my code is creating the leak. But with other dataset mode (different code for reading other datasets) of the same NCC1701 pipeline of mine did not create the leak. |
I also got a similar behaviour while using the Data Module. ` mask_func = create_mask_for_mask_type( root_gt = "/data/project/fastMRI/Brain" data_module = FastMriDataModule(data_path=root_gt, for e in tqdm(dl): ` |
Hello @soumickmj, I ran your script on the knee validation data with memory-profiler and memory usage peaked pretty early a little bit less than 5 GB (see attached), staying flat for the rest of the entire dataset afterwards (which does not suggest a leak). Perhaps you could try running on your system to verify with PyTorch 1.10? This is the code: from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm
@profile
def main():
val_path = "/path/multicoil_val"
mask_func = create_mask_for_mask_type(
mask_type_str="random", center_fractions=[0.08], accelerations=[8]
)
sd = SliceDataset(
root=val_path,
challenge="multicoil",
transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
)
dl = DataLoader(sd, batch_size=1, shuffle=False, num_workers=10)
for e in tqdm(dl):
del e
pass
if __name__ == "__main__":
main() You can run with |
hi @mmuckley,I copied the above code to my machine, but modified batch Szie. The figure below is the result of This is the code: from fastmri.data.transforms import UnetDataTransform
from fastmri.data import SliceDataset
from fastmri.data.subsample import create_mask_for_mask_type
import os
from torch.utils.data import DataLoader
from tqdm import tqdm
@profile
def main():
val_path = "/data2/fastmri/mnt/multicoil_val"
mask_func = create_mask_for_mask_type(
mask_type_str="random", center_fractions=[0.08], accelerations=[8]
)
sd = SliceDataset(
root=val_path,
challenge="multicoil",
transform=UnetDataTransform("multicoil", mask_func=mask_func, use_seed=False),
)
dl = DataLoader(sd, batch_size=4, shuffle=False, num_workers=10)
for e in tqdm(sd):
del e
pass
if __name__ == "__main__":
main() Maybe it's PyTorch version that's causing the problem. My PyTorch version is 1.8.1+ cu111 |
Sorry @mmuckley I also got the same problem after running memory profiler. With PyTorch 1.10.2 py3.9_cuda11.3_cudnn8.2.0_0 I got:- With PyTorch 1.11.0.dev20220129 py3.9_cuda11.3_cudnn8.2.0_0 (pytorch-nightly), due to the features I use, I usually need this version for my work:- I did not run till the very end as it was continously increasing and would have crashed the server again which has 250GB of RAM. So I don't feel its related to the PyTorch version. Just to let you know: the OS is Ubuntu 20.04.3 LTS and the python version is 3.9.7 |
Hi @mmuckley Am I doing something wrong somewhere? |
Hello @soumickmj, I copied the wrong code. The paste I showed was with the dataloader, not the dataset. This is what I get with dataset. You can see the max is about 720 MB. |
Ah okay, no problem! |
One thing I notice is that you are both using Python 3.9. I could try Python 3.9 and check with that perhaps. EDIT: Sorry, I see you also tried Python 3.7. Not sure what to do then... |
Okay, I tested on Python 3.9 and I'm still getting the same behavior with To help a bit more with this I'm including my complete |
Thanks @mmuckley Then I created a "bare minimum" env without using your env, and this resulted in the old issue again. I compared the versions. Initially, the version of numpy was different (1.21.2). So I switched to the one you have 1.20.3 Do you have any idea what might be the reason? |
I was in a similar situation. The problem was solved by installing your Py3.9 environment directly with Conda, however, creating a Py3.9 environment from Conda normally and pip install git+https://github.com/facebookresearch/fastMRI.git and pandas (It's not in the FastMRI package), the problem remains. |
So my install process is as follows in a few bash commands: conda create -n memory_test_py39 python=3.9
conda activate memory_test_py39
conda install anaconda
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install -e . Where the In that case I will try to reproduce now with your minimal environments. |
@soumickmj @834799106 I can now reproduce this with the minimal install environment. Reproduction environment here: https://gist.github.com/mmuckley/838289a388bc65a7adb23d67908635c9. |
@mmuckley This is really strange! |
I do think it is related to |
Aahahaha, yes, I can confirm that too. Then the problem is with the DataTransformations and not with the SliceDataset! |
SliceDataset
UNetDataTransform
and VarNetDataTransform
I might have found the source. I did not test using your transforms though, but using my transform - but they were showing a similar behaviour. EDIT: Here's my code |
I tried calling |
Okay I think I found the issue: it is due to the If you install the minimal environment, but use the |
UNetDataTransform
and VarNetDataTransform
h5py
from pip
and conversion to torch.Tensor
Thanks @mmuckley |
With the Summary of the h5py configuration
---------------------------------
h5py 3.6.0
HDF5 1.10.6
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.21.2
cython (built with) 0.29.24
numpy (built against) 1.16.6
HDF5 (built against) 1.10.6 So the |
Thanks @mmuckley Thanks for all your help! PS: Maybe you can put a notice on the homepage of fastMRI for people to know about this as the impact can be significant. |
Okay I opened #217 to do this. Feel free to propose any changes. For what it's worth, I did some small tests on adding extra copy commands to |
Thanks! |
Hi @mmuckley, the issue resurfaced after the conda version of h5py got updated as well. So basically, inside the get item function of mri_data.py, with h5py.File(fname, "r") as hf: Maybe its dirty (not sure if it will have some other implications - say in terms of speed), but for me, its working so far. |
Hello @soumickmj I do not observe HDF5 being updated on |
Hi @mmuckley I am running varnet demo for a small set of brain mri data, and I am getting the following error after 3-4 iterations of the first epoch: RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 11.91 GiB total capacity; 10.40 GiB already allocated; 74.81 MiB free; 10.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Following is my h5py version: h5py 3.7.0 I am also attaching my memory profiler plot. Please tell me where i am going wrong. |
Hello @Sarah-2021-scu, your memory usage is good. The error in your case is on the GPU, which is not related to this particular issue. It looks like your GPU may just be too small. You could try running the model with a lower cascade count. |
Thank you @mmuckley for your response. I am using 2 GPU's 12GB of memory each. I will lower the cascade count as well. The other options I have are:
|
I got the same situation while trained the 'unet baseline model', and I followed this issue #217, by the command |
I recently tried to do some experiments on my model with multi-coil FastMRI brain data. Due to the need for flexibility (and also because I don't have the extra time to learn how to use Pytorch lighting), I didn't use Pytorch Lighting directly. Instead, I chose normal Pytorch, but during the iterating process, I only set num_worker=2, and my memory footprint was quite large at the beginning. As the number of iterations increased, an error occurred:
RuntimeError: DataLoader worker (PID 522908) is killed by signal: killed.
I checked the training codes of other parts, but no obvious memory accumulation error was found. Therefore, I thought there was a large probability of a problem in siliceDataset. I simply used "pass" to traverse the Dataloader loop, and found that the memory occupation kept rising.
The text was updated successfully, but these errors were encountered: