-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #11
Comments
Thank you for your interest! |
Oh, thank you for your response. I'm sorry, that was my mistake. When adjusting the batch size, I only modified the default value in the training code but forgot to update it in the script. However, now I have encountered another issue: Traceback (most recent call last): -- Process 0 terminated with the following error: |
Thank you for your interest! VA-DepthNet/vadepthnet/networks/vadepthnet.py Line 185 in 44061e1
Could you please slightly increase the jitter by changing the above line to: jitter = torch.eye(n=hw, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-6 or jitter = torch.eye(n=hw, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-2 |
I replaced the code using the solution you provided, but I still encounter the same error. |
Thank you ! |
It still doesn't seem to have any effect. |
Sorry, could you please print the ATA? VA-DepthNet/vadepthnet/networks/vadepthnet.py Line 186 in 44061e1
|
The ATA printed as follows:
|
Hi, the number looks fine. |
Thank you for your selfless help. The print() results are follows: |
Hi, thank you for your reply. |
Thank you for your prompt reply. I would like to know what exactly invalid pixels refer to. Are there any specific characteristics that can be used to determine the presence of invalid pixels? |
I guess there might be pixels with nan value after read or preprocess. Could you please try to replace the ATA below with your input image tensor: |
code at train.py line 296 is :
results :
tensor([[[[0.6406, 0.6406, 0.6406, ..., 0.5625, 0.5625, 0.5625], |
Thank you for your information. VA-DepthNet/vadepthnet/networks/vadepthnet.py Line 165 in 44061e1 |
Thx, the results are: |
Thank you for your information. Could you add the following code: VA-DepthNet/vadepthnet/networks/vadepthnet.py Line 184 in 44061e1 |
batch size is 1, and the results are: tensor(False, device='cuda:0') |
It shows the predicted 'att' has already been NAN. |
No, the last attempt only had 3 outputs. I also don't know why there are 8 values outputted this time. == Total number of parameters: 263110761 and this time code and print: == Total number of parameters: 263110761
|
Thank you for your information. Could you add the following code: VA-DepthNet/vadepthnet/networks/vadepthnet.py Line 184 in 44061e1 And do you use FP 32 or FP 16? |
Thank you for your prompt reply. The print is:tensor(False, device='cuda:0')
|
The output shows that there is NAN in your input image tensor x. |
The DataLoader section only modified the file path concatenation part, and no changes were made to the preprocessing part. |
I see. I guess there is NAN value after reading the images. |
In the previous discussion, there are no abnormal values in the image and depth_gt in the line of code |
Yes, but when you put the print function inside the model forward function, it shows the input image tensor contains NAN values. |
Is this what you mean?
and there is a relatively long output. One of the output groups is as follows:
tensor([[[[0.6445, 0.6445, 0.6445, ..., 0.5078, 0.5078, 0.5078], |
Yes, it looks fine. Are all the outputs False? Another question is, how do you initialize the swin transformer? pretrained or random? |
Yes, I searched for "True" in the command line, but there were no matches in the printed content. So all the outputs should be "False". Additionally, I initialized the Swin Transformer with swin_large_patch4_window12_384_22k.pth. |
I see, so the input image is fine. VA-DepthNet/vadepthnet/networks/vadepthnet.py Line 294 in 44061e1 I guess there might be NAN values after swin transformer. Do you use FP 16 or FP 32? |
I didn't make any additional settings, I just followed the steps in the GitHub repository for training. And the output results are as follows: |
I see. The input image tensor (x) is fine, but the extracted feature (x2) from swin transformer is NAN. Could you try to replace the code: VA-DepthNet/vadepthnet/networks/vadepthnet.py Line 331 in 44061e1
by: print(torch.isnan(si_loss).any()) return outs['scale_1'], si_loss or |
The print() result is : tensor(True, device='cuda:0') |
Only one output? |
Yes. The more complete output is as follows: |
I see. There is NAN in the si loss at the first iteration. Could you add the following code: VA-DepthNet/vadepthnet/networks/loss.py Line 160 in 44061e1 |
Thank you for your patient response. The output after adding the code is as follows: == Initial variables' sum: -130969.684, avg: -325.795 |
I see. The output from the network and the ground truth are both fine at the first iteration. VA-DepthNet/vadepthnet/networks/loss.py Line 162 in 44061e1
by: After this, will the si loss still be NAN? |
The result of code : print(torch.isnan(si_loss).any()) is : |
I see. I'm not sure whether the NAN in the loss comes from the first, or second, or third scale of prediction. VA-DepthNet/vadepthnet/networks/loss.py Line 173 in 44061e1 |
I printed the feat and loss1, and the results are: |
Thank you for your great work.
I am using an NVIDIA 3090 graphics card and trying to train with my own dataset. The dimensions of the dataset are consistent with KITTI. I attempted to modify the batch size, but it had no effect. The error details are as follows:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 684.00 MiB (GPU 0; 23.69 GiB total capacity; 20.26 GiB already allocated; 386.12 MiB free; 21.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Additionally, the dataset I am using is in TIFF format.
The text was updated successfully, but these errors were encountered: