Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #11

Open
guwenxiang1 opened this issue Dec 5, 2023 · 42 comments
Open

CUDA out of memory #11

guwenxiang1 opened this issue Dec 5, 2023 · 42 comments

Comments

@guwenxiang1
Copy link

Thank you for your great work.
I am using an NVIDIA 3090 graphics card and trying to train with my own dataset. The dimensions of the dataset are consistent with KITTI. I attempted to modify the batch size, but it had no effect. The error details are as follows:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 684.00 MiB (GPU 0; 23.69 GiB total capacity; 20.26 GiB already allocated; 386.12 MiB free; 21.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additionally, the dataset I am using is in TIFF format.

@cnexah
Copy link
Owner

cnexah commented Dec 6, 2023

Thank you for your interest!
I also use 3090. In my memory, each 3090 can support batch size=1 for training.
Could you please try to set batch size=1?

@guwenxiang1
Copy link
Author

Oh, thank you for your response. I'm sorry, that was my mistake. When adjusting the batch size, I only modified the default value in the training code but forgot to update it in the script. However, now I have encountered another issue:

Traceback (most recent call last):
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 403, in
main()
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 397, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 298, in main_worker
depth_est, loss = model(image, depth_gt)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/networks/vadepthnet.py", line 301, in forward
d = self.vlayer(x)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/networks/vadepthnet.py", line 186, in forward
x = torch.linalg.solve(ATA+jitter, ATB)
torch._C._LinAlgError: torch.linalg.solve: (Batch element 0): The solver failed because the input matrix is singular.

@cnexah
Copy link
Owner

cnexah commented Dec 6, 2023

Thank you for your interest!

jitter = torch.eye(n=h*w, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-12

Could you please slightly increase the jitter by changing the above line to:
jitter = torch.eye(n=hw, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-6
or
jitter = torch.eye(n=h
w, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-2

@guwenxiang1
Copy link
Author

I replaced the code using the solution you provided, but I still encounter the same error.

@cnexah
Copy link
Owner

cnexah commented Dec 7, 2023

Thank you !
Could you please try to further increase the jitter, such as :
jitter = torch.eye(n=hw, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-1

@guwenxiang1
Copy link
Author

It still doesn't seem to have any effect.

@cnexah
Copy link
Owner

cnexah commented Dec 7, 2023

Sorry, could you please print the ATA?

x, _ = torch.solve(ATB, ATA+jitter)

@guwenxiang1
Copy link
Author

The ATA printed as follows:
tensor([[[ 0.9441, -0.2098, -0.2658, ..., 0.0000, 0.0000, 0.0000],
[-0.2098, 1.1580, -0.1993, ..., 0.0000, 0.0000, 0.0000],
[-0.2658, -0.1993, 1.4527, ..., 0.0000, 0.0000, 0.0000],
...,
[ 0.0000, 0.0000, 0.0000, ..., 1.5859, -0.2865, -0.2426],
[ 0.0000, 0.0000, 0.0000, ..., -0.2865, 1.3876, -0.2634],
[ 0.0000, 0.0000, 0.0000, ..., -0.2426, -0.2634, 2.0849]],

    [[ 0.9383, -0.2406, -0.2330,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2406,  1.1164, -0.2512,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2330, -0.2512,  1.4258,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3777, -0.2286, -0.2445],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2286,  1.0807, -0.2032],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2445, -0.2032,  1.9325]],

    [[ 0.9032, -0.2028, -0.2351,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2028,  1.0213, -0.1952,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2351, -0.1952,  1.2983,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3727, -0.2265, -0.2391],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2265,  1.2497, -0.2450],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2391, -0.2450,  2.0582]],

    ...,

    [[ 0.9478, -0.2418, -0.2251,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2418,  1.2317, -0.2842,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2251, -0.2842,  1.5352,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3787, -0.2229, -0.2946],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2229,  1.0788, -0.2006],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2946, -0.2006,  1.9220]],

    [[ 1.0579, -0.2751, -0.2774,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2751,  1.2474, -0.2404,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2774, -0.2404,  1.5164,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3541, -0.1987, -0.2450],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.1987,  1.0521, -0.1937],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2450, -0.1937,  1.7845]],

    [[ 0.9832, -0.2376, -0.2400,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2376,  1.2417, -0.2379,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2400, -0.2379,  1.4434,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.4390, -0.2949, -0.2152],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2949,  1.2868, -0.2860],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2152, -0.2860,  1.9313]]],
   device='cuda:0', grad_fn=<BmmBackward0>)

@cnexah
Copy link
Owner

cnexah commented Dec 13, 2023

Hi, the number looks fine.
I was wondering if there is some invalid values in the matrix?
could you print the following result?
print(torch.isnan(ATA).any())
print(torch.isinf(ATA).any())

@guwenxiang1
Copy link
Author

Thank you for your selfless help. The print() results are follows:
tensor(True, device='cuda:0')
tensor(False, device='cuda:0')

@cnexah
Copy link
Owner

cnexah commented Dec 14, 2023

Hi, thank you for your reply.
It shows that there is nan in the ATA matrix.
Could you please check the input image? I guess there is invalid pixels.

@guwenxiang1
Copy link
Author

Thank you for your prompt reply. I would like to know what exactly invalid pixels refer to. Are there any specific characteristics that can be used to determine the presence of invalid pixels?

@cnexah
Copy link
Owner

cnexah commented Dec 14, 2023

I guess there might be pixels with nan value after read or preprocess.

Could you please try to replace the ATA below with your input image tensor:
print(torch.isnan(ATA).any())

@guwenxiang1
Copy link
Author

code at train.py line 296 is :
image = torch.autograd.Variable(sample_batched['image'].cuda(args.gpu, non_blocking=True))
depth_gt = torch.autograd.Variable(sample_batched['depth'].cuda(args.gpu, non_blocking=True))

        print(torch.isnan(image).any())
        print(torch.isnan(depth_gt).any())
        print(torch.isinf(image).any())
        print(torch.isinf(depth_gt).any())
        print(image)
        print(depth_gt)

results :
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor([[[[-0.3602, -0.3602, -0.3602, ..., 0.4789, 0.4945, 0.5257],
[-0.3924, -0.4247, -0.4409, ..., 0.4789, 0.4945, 0.5413],
[-0.3763, -0.4086, -0.4247, ..., 0.3852, 0.5101, 0.6346],
...,
[-1.4412, -1.4238, -1.4412, ..., 0.0072, 0.1180, 0.1811],
[-1.4238, -1.4065, -1.4238, ..., 0.2440, 0.2597, 0.3068],
[-1.4238, -1.4065, -1.4065, ..., 0.3382, 0.4633, 0.4008]],

     [[-0.0337, -0.0501, -0.0501,  ...,  0.4697,  0.5822,  0.6303],
      [-0.0665, -0.0994, -0.1323,  ...,  0.4536,  0.5340,  0.6463],
      [-0.0337, -0.0665, -0.0994,  ...,  0.3406,  0.5340,  0.7422],
      ...,
      [-1.1466, -1.1291, -1.1466,  ...,  0.5340,  0.6623,  0.7103],
      [-1.1466, -1.1116, -1.1291,  ...,  0.7901,  0.8060,  0.8219],
      [-1.1641, -1.1291, -1.1291,  ...,  0.8697,  0.9809,  0.9174]],

     [[-0.1725, -0.1880, -0.1880,  ...,  0.3506,  0.5175,  0.6082],
      [-0.1880, -0.2348, -0.2504,  ...,  0.3202,  0.4721,  0.6082],
      [-0.1725, -0.2036, -0.2348,  ...,  0.2286,  0.4266,  0.6685],
      ...,
      [-1.0191, -1.0026, -1.0356,  ...,  0.3658,  0.4721,  0.4873],
      [-1.0356, -1.0026, -1.0191,  ...,  0.5629,  0.5931,  0.5780],
      [-1.0522, -1.0191, -1.0191,  ...,  0.6534,  0.6985,  0.6534]]]],
   device='cuda:0')

tensor([[[[0.6406, 0.6406, 0.6406, ..., 0.5625, 0.5625, 0.5625],
[0.6406, 0.6406, 0.6367, ..., 0.5625, 0.5625, 0.5586],
[0.6367, 0.6367, 0.6367, ..., 0.5586, 0.5625, 0.5586],
...,
[0.5820, 0.5820, 0.5820, ..., 0.6172, 0.6172, 0.6172],
[0.5820, 0.5781, 0.5820, ..., 0.6172, 0.6172, 0.6172],
[0.5820, 0.5781, 0.5820, ..., 0.6172, 0.6172, 0.6172]]]],
device='cuda:0')

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

Thank you for your information.
It shows that the input is fine. There is no nan values.
Could you add the following code:
print(torch.isnan(x).any())
print(torch.isnan(att).any())
print(torch.isnan(grad).any())
to :

@guwenxiang1
Copy link
Author

Thx, the results are:
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

Thank you for your information.

Could you add the following code:
print(torch.isnan(self.a).any())
print(torch.isnan(att).any())
print(torch.isnan(A).any())
print(torch.isnan(ATA).any())
to :

@guwenxiang1
Copy link
Author

batch size is 1, and the results are:

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
/home/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(False, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

It shows the predicted 'att' has already been NAN.
Why are there 8 outputs? Do you have 6 outputs in the last time attempt?

@guwenxiang1
Copy link
Author

No, the last attempt only had 3 outputs. I also don't know why there are 8 values outputted this time.
The last attempt print:

== Total number of parameters: 263110761
== Total number of learning parameters: 263110761
== Model Initialized on GPU: 0
== Initial variables' sum: -131044.828, avg: -325.982
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
Traceback (most recent call last):
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 405, in
main()
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 399, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

and this time code and print:
ATB = torch.bmm(AT, B)
print('=================================')
print(torch.isnan(self.a).any())
print(torch.isnan(att).any())
print(torch.isnan(A).any())
print(torch.isnan(ATA).any())
print('=================================')
jitter = torch.eye(n=h*w, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-1

== Total number of parameters: 263110761
== Total number of learning parameters: 263110761
== Model Initialized on GPU: 0
== Initial variables' sum: -131075.842, avg: -326.059

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))

tensor(False, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

Traceback (most recent call last):
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 405, in
main()
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 399, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

Thank you for your information.

Could you add the following code:
print(torch.isnan(x).any())
print(torch.isnan(att).any())
print(torch.isnan(A).any())
print(torch.isnan(ATA).any())
to :

And do you use FP 32 or FP 16?

@guwenxiang1
Copy link
Author

Thank you for your prompt reply. The print is:

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))

tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

Traceback (most recent call last):

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

The output shows that there is NAN in your input image tensor x.
Could you check your input image reading and preprocessing?

@guwenxiang1
Copy link
Author

The DataLoader section only modified the file path concatenation part, and no changes were made to the preprocessing part.

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

I see. I guess there is NAN value after reading the images.

@guwenxiang1
Copy link
Author

In the previous discussion, there are no abnormal values in the image and depth_gt in the line of code
depth_est, loss = model(image, depth_gt)
, right?

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

Yes, but when you put the print function inside the model forward function, it shows the input image tensor contains NAN values.
Or could you remove the lines from 295 to 303 in the train.py, and check the input tensor again?
It might be important to run longer time.

@guwenxiang1
Copy link
Author

Is this what you mean?
The code is as follows,

            image = torch.autograd.Variable(sample_batched['image'].cuda(args.gpu, non_blocking=True)) 
            depth_gt = torch.autograd.Variable(sample_batched['depth'].cuda(args.gpu, non_blocking=True))
            print(torch.isnan(image).any())
            print(torch.isnan(depth_gt).any())
            print(torch.isinf(image).any())
            print(torch.isinf(depth_gt).any())
            print(image)
            print(depth_gt)
            # depth_est, loss = model(image, depth_gt)
            # loss.backward()
            # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            # optimizer.step()

and there is a relatively long output. One of the output groups is as follows:
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor([[[[-1.1755, -1.1591, -1.1755, ..., 0.6245, 0.7728, 0.9203],
[-1.2086, -1.2418, -1.2086, ..., 0.6096, 0.7135, 0.8466],
[-1.2751, -1.2584, -1.2751, ..., 0.4305, 0.5500, 0.6690],
...,
[-1.1261, -1.1921, -1.2252, ..., 0.5649, 0.6096, 0.6690],
[-1.1261, -1.1921, -1.2584, ..., 0.5947, 0.5947, 0.7432],
[-1.1261, -1.2086, -1.2751, ..., 0.5202, 0.5649, 0.7432]],

     [[-0.7488, -0.7122, -0.7122,  ...,  1.0808,  1.2984,  1.4815],
      [-0.7671, -0.7855, -0.7488,  ...,  1.0304,  1.1982,  1.3818],
      [-0.8407, -0.8039, -0.8039,  ...,  0.8447,  0.9967,  1.1647],
      ...,
      [-0.8223, -0.8962, -0.9333,  ...,  1.1814,  1.2483,  1.3318],
      [-0.8407, -0.9147, -0.9892,  ...,  1.2149,  1.2483,  1.4317],
      [-0.8592, -0.9333, -1.0267,  ...,  1.1312,  1.1982,  1.3984]],

     [[-0.6364, -0.5874, -0.5874,  ...,  0.1597,  0.3301,  0.5145],
      [-0.6200, -0.6037, -0.5874,  ...,  0.1129,  0.2528,  0.4378],
      [-0.6200, -0.5874, -0.5548,  ...,  0.0348,  0.1129,  0.2837],
      ...,
      [-0.7350, -0.7681, -0.7847,  ...,  0.7430,  0.8036,  0.8942],
      [-0.7681, -0.7847, -0.8345,  ...,  0.8187,  0.8641,  0.9996],
      [-0.7847, -0.8179, -0.8847,  ...,  0.7582,  0.7733,  0.9093]]]],
   device='cuda:0')

tensor([[[[0.6445, 0.6445, 0.6445, ..., 0.5078, 0.5078, 0.5078],
[0.6445, 0.6406, 0.6406, ..., 0.5078, 0.5078, 0.5078],
[0.6406, 0.6406, 0.6406, ..., 0.5078, 0.5039, 0.5039],
...,
[0.6367, 0.6367, 0.6367, ..., 0.4883, 0.4844, 0.4844],
[0.6367, 0.6367, 0.6367, ..., 0.4883, 0.4844, 0.4883],
[0.6367, 0.6367, 0.6367, ..., 0.4844, 0.4844, 0.4844]]]],
device='cuda:0')

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

Yes, it looks fine. Are all the outputs False?

Another question is, how do you initialize the swin transformer? pretrained or random?

@guwenxiang1
Copy link
Author

Yes, I searched for "True" in the command line, but there were no matches in the printed content. So all the outputs should be "False". Additionally, I initialized the Swin Transformer with swin_large_patch4_window12_384_22k.pth.

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

I see, so the input image is fine.
Could you add the following code:
print(torch.isnan(x).any())
print(torch.isnan(x2).any())
print(torch.isnan(x3).any())
print(torch.isnan(x4).any())
print(torch.isnan(x5).any())
to :

I guess there might be NAN values after swin transformer. Do you use FP 16 or FP 32?

@guwenxiang1
Copy link
Author

I didn't make any additional settings, I just followed the steps in the GitHub repository for training. And the output results are as follows:
== Initial variables' sum: -130997.640, avg: -325.865
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(False, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
Traceback (most recent call last):

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

I see. The input image tensor (x) is fine, but the extracted feature (x2) from swin transformer is NAN.
I guess the NAN starts from the second iteration, after the backward propagation of loss.

Could you try to replace the code:

return outs['scale_1'], var_loss + si_loss

by:
print(torch.isnan(si_loss).any())
return outs['scale_1'], si_loss

or
print(torch.isnan(si_loss).any())
return outs['scale_1'], var_loss + si_loss

@guwenxiang1
Copy link
Author

The print() result is : tensor(True, device='cuda:0')

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

Only one output?

@guwenxiang1
Copy link
Author

Yes. The more complete output is as follows:
== Initial variables' sum: -131047.728, avg: -325.989
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(True, device='cuda:0')
Traceback (most recent call last):

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

I see. There is NAN in the si loss at the first iteration.
It might comes from the output or the ground truth.

Could you add the following code:
print(torch.isnan(depth_prediction).any())
print(torch.isnan(reshaped_gt).any())
print(torch.isnan(diff).any())
to

@guwenxiang1
Copy link
Author

Thank you for your patient response. The output after adding the code is as follows:

== Initial variables' sum: -130969.684, avg: -325.795
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
Traceback (most recent call last):

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

I see. The output from the network and the ground truth are both fine at the first iteration.
Could you replace the following code:

num_pixels = num_pixels.reshape(num_pixels.shape[0], -1).sum(dim=-1) + 1e-6

by:
num_pixels = num_pixels.reshape(num_pixels.shape[0], -1).sum(dim=-1) + 1e-1

After this, will the si loss still be NAN?

@guwenxiang1
Copy link
Author

The result of code : print(torch.isnan(si_loss).any()) is :
== Initial variables' sum: -130963.132, avg: -325.779
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(True, device='cuda:0')
Traceback (most recent call last):

@cnexah
Copy link
Owner

cnexah commented Dec 15, 2023

I see. I'm not sure whether the NAN in the loss comes from the first, or second, or third scale of prediction.
Could you add the following code:
return loss1
to:

@guwenxiang1
Copy link
Author

I printed the feat and loss1, and the results are:
== Initial variables' sum: -131122.057, avg: -326.174
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
feat
{'scale_1': tensor([[[[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf],
...,
[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf]]]], device='cuda:0',
grad_fn=)}
tensor(nan, device='cuda:0', grad_fn=)
tensor(True, device='cuda:0')
Traceback (most recent call last):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants