CUDA out of memory #11

guwenxiang1 · 2023-12-05T11:32:18Z

Thank you for your great work.
I am using an NVIDIA 3090 graphics card and trying to train with my own dataset. The dimensions of the dataset are consistent with KITTI. I attempted to modify the batch size, but it had no effect. The error details are as follows:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 684.00 MiB (GPU 0; 23.69 GiB total capacity; 20.26 GiB already allocated; 386.12 MiB free; 21.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additionally, the dataset I am using is in TIFF format.

cnexah · 2023-12-06T01:53:26Z

Thank you for your interest!
I also use 3090. In my memory, each 3090 can support batch size=1 for training.
Could you please try to set batch size=1?

guwenxiang1 · 2023-12-06T06:08:54Z

Oh, thank you for your response. I'm sorry, that was my mistake. When adjusting the batch size, I only modified the default value in the training code but forgot to update it in the script. However, now I have encountered another issue:

Traceback (most recent call last):
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 403, in
main()
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 397, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 298, in main_worker
depth_est, loss = model(image, depth_gt)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/networks/vadepthnet.py", line 301, in forward
d = self.vlayer(x)
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/networks/vadepthnet.py", line 186, in forward
x = torch.linalg.solve(ATA+jitter, ATB)
torch._C._LinAlgError: torch.linalg.solve: (Batch element 0): The solver failed because the input matrix is singular.

cnexah · 2023-12-06T12:37:33Z

Thank you for your interest!

VA-DepthNet/vadepthnet/networks/vadepthnet.py

Line 185 in 44061e1

jitter = torch.eye(n=h*w, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-12

Could you please slightly increase the jitter by changing the above line to:
jitter = torch.eye(n=hw, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-6
or
jitter = torch.eye(n=hw, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-2

guwenxiang1 · 2023-12-06T13:23:38Z

I replaced the code using the solution you provided, but I still encounter the same error.

cnexah · 2023-12-07T02:08:36Z

Thank you ！
Could you please try to further increase the jitter, such as :
jitter = torch.eye(n=hw, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-1

guwenxiang1 · 2023-12-07T03:17:23Z

It still doesn't seem to have any effect.

cnexah · 2023-12-07T12:28:07Z

Sorry, could you please print the ATA?

VA-DepthNet/vadepthnet/networks/vadepthnet.py

Line 186 in 44061e1

x, _ = torch.solve(ATB, ATA+jitter)

guwenxiang1 · 2023-12-11T05:13:13Z

The ATA printed as follows:
tensor([[[ 0.9441, -0.2098, -0.2658, ..., 0.0000, 0.0000, 0.0000],
[-0.2098, 1.1580, -0.1993, ..., 0.0000, 0.0000, 0.0000],
[-0.2658, -0.1993, 1.4527, ..., 0.0000, 0.0000, 0.0000],
...,
[ 0.0000, 0.0000, 0.0000, ..., 1.5859, -0.2865, -0.2426],
[ 0.0000, 0.0000, 0.0000, ..., -0.2865, 1.3876, -0.2634],
[ 0.0000, 0.0000, 0.0000, ..., -0.2426, -0.2634, 2.0849]],

    [[ 0.9383, -0.2406, -0.2330,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2406,  1.1164, -0.2512,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2330, -0.2512,  1.4258,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3777, -0.2286, -0.2445],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2286,  1.0807, -0.2032],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2445, -0.2032,  1.9325]],

    [[ 0.9032, -0.2028, -0.2351,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2028,  1.0213, -0.1952,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2351, -0.1952,  1.2983,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3727, -0.2265, -0.2391],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2265,  1.2497, -0.2450],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2391, -0.2450,  2.0582]],

    ...,

    [[ 0.9478, -0.2418, -0.2251,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2418,  1.2317, -0.2842,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2251, -0.2842,  1.5352,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3787, -0.2229, -0.2946],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2229,  1.0788, -0.2006],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2946, -0.2006,  1.9220]],

    [[ 1.0579, -0.2751, -0.2774,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2751,  1.2474, -0.2404,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2774, -0.2404,  1.5164,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.3541, -0.1987, -0.2450],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.1987,  1.0521, -0.1937],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2450, -0.1937,  1.7845]],

    [[ 0.9832, -0.2376, -0.2400,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2376,  1.2417, -0.2379,  ...,  0.0000,  0.0000,  0.0000],
     [-0.2400, -0.2379,  1.4434,  ...,  0.0000,  0.0000,  0.0000],
     ...,
     [ 0.0000,  0.0000,  0.0000,  ...,  1.4390, -0.2949, -0.2152],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2949,  1.2868, -0.2860],
     [ 0.0000,  0.0000,  0.0000,  ..., -0.2152, -0.2860,  1.9313]]],
   device='cuda:0', grad_fn=<BmmBackward0>)

cnexah · 2023-12-13T03:15:05Z

Hi, the number looks fine.
I was wondering if there is some invalid values in the matrix?
could you print the following result?
print(torch.isnan(ATA).any())
print(torch.isinf(ATA).any())

guwenxiang1 · 2023-12-13T12:56:58Z

Thank you for your selfless help. The print() results are follows:
tensor(True, device='cuda:0')
tensor(False, device='cuda:0')

cnexah · 2023-12-14T01:45:41Z

Hi, thank you for your reply.
It shows that there is nan in the ATA matrix.
Could you please check the input image? I guess there is invalid pixels.

guwenxiang1 · 2023-12-14T02:37:25Z

Thank you for your prompt reply. I would like to know what exactly invalid pixels refer to. Are there any specific characteristics that can be used to determine the presence of invalid pixels?

cnexah · 2023-12-14T02:53:32Z

I guess there might be pixels with nan value after read or preprocess.

Could you please try to replace the ATA below with your input image tensor:
print(torch.isnan(ATA).any())

guwenxiang1 · 2023-12-14T07:10:08Z

code at train.py line 296 is :
image = torch.autograd.Variable(sample_batched['image'].cuda(args.gpu, non_blocking=True))
depth_gt = torch.autograd.Variable(sample_batched['depth'].cuda(args.gpu, non_blocking=True))

        print(torch.isnan(image).any())
        print(torch.isnan(depth_gt).any())
        print(torch.isinf(image).any())
        print(torch.isinf(depth_gt).any())
        print(image)
        print(depth_gt)

results :
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor([[[[-0.3602, -0.3602, -0.3602, ..., 0.4789, 0.4945, 0.5257],
[-0.3924, -0.4247, -0.4409, ..., 0.4789, 0.4945, 0.5413],
[-0.3763, -0.4086, -0.4247, ..., 0.3852, 0.5101, 0.6346],
...,
[-1.4412, -1.4238, -1.4412, ..., 0.0072, 0.1180, 0.1811],
[-1.4238, -1.4065, -1.4238, ..., 0.2440, 0.2597, 0.3068],
[-1.4238, -1.4065, -1.4065, ..., 0.3382, 0.4633, 0.4008]],

     [[-0.0337, -0.0501, -0.0501,  ...,  0.4697,  0.5822,  0.6303],
      [-0.0665, -0.0994, -0.1323,  ...,  0.4536,  0.5340,  0.6463],
      [-0.0337, -0.0665, -0.0994,  ...,  0.3406,  0.5340,  0.7422],
      ...,
      [-1.1466, -1.1291, -1.1466,  ...,  0.5340,  0.6623,  0.7103],
      [-1.1466, -1.1116, -1.1291,  ...,  0.7901,  0.8060,  0.8219],
      [-1.1641, -1.1291, -1.1291,  ...,  0.8697,  0.9809,  0.9174]],

     [[-0.1725, -0.1880, -0.1880,  ...,  0.3506,  0.5175,  0.6082],
      [-0.1880, -0.2348, -0.2504,  ...,  0.3202,  0.4721,  0.6082],
      [-0.1725, -0.2036, -0.2348,  ...,  0.2286,  0.4266,  0.6685],
      ...,
      [-1.0191, -1.0026, -1.0356,  ...,  0.3658,  0.4721,  0.4873],
      [-1.0356, -1.0026, -1.0191,  ...,  0.5629,  0.5931,  0.5780],
      [-1.0522, -1.0191, -1.0191,  ...,  0.6534,  0.6985,  0.6534]]]],
   device='cuda:0')

tensor([[[[0.6406, 0.6406, 0.6406, ..., 0.5625, 0.5625, 0.5625],
[0.6406, 0.6406, 0.6367, ..., 0.5625, 0.5625, 0.5586],
[0.6367, 0.6367, 0.6367, ..., 0.5586, 0.5625, 0.5586],
...,
[0.5820, 0.5820, 0.5820, ..., 0.6172, 0.6172, 0.6172],
[0.5820, 0.5781, 0.5820, ..., 0.6172, 0.6172, 0.6172],
[0.5820, 0.5781, 0.5820, ..., 0.6172, 0.6172, 0.6172]]]],
device='cuda:0')

cnexah · 2023-12-15T02:35:57Z

Thank you for your information.
It shows that the input is fine. There is no nan values.
Could you add the following code:
print(torch.isnan(x).any())
print(torch.isnan(att).any())
print(torch.isnan(grad).any())
to :

VA-DepthNet/vadepthnet/networks/vadepthnet.py

Line 165 in 44061e1

guwenxiang1 · 2023-12-15T03:02:14Z

Thx, the results are:
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

cnexah · 2023-12-15T03:18:40Z

Thank you for your information.

Could you add the following code:
print(torch.isnan(self.a).any())
print(torch.isnan(att).any())
print(torch.isnan(A).any())
print(torch.isnan(ATA).any())
to :

VA-DepthNet/vadepthnet/networks/vadepthnet.py

Line 184 in 44061e1

guwenxiang1 · 2023-12-15T03:33:47Z

batch size is 1, and the results are:

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
/home/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(False, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

cnexah · 2023-12-15T03:37:39Z

It shows the predicted 'att' has already been NAN.
Why are there 8 outputs? Do you have 6 outputs in the last time attempt?

guwenxiang1 · 2023-12-15T03:44:14Z

No, the last attempt only had 3 outputs. I also don't know why there are 8 values outputted this time.
The last attempt print:

== Total number of parameters: 263110761
== Total number of learning parameters: 263110761
== Model Initialized on GPU: 0
== Initial variables' sum: -131044.828, avg: -325.982
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
Traceback (most recent call last):
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 405, in
main()
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 399, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

and this time code and print:
ATB = torch.bmm(AT, B)
print('=================================')
print(torch.isnan(self.a).any())
print(torch.isnan(att).any())
print(torch.isnan(A).any())
print(torch.isnan(ATA).any())
print('=================================')
jitter = torch.eye(n=h*w, dtype=x.dtype, device=x.device).unsqueeze(0) * 1e-1

== Total number of parameters: 263110761
== Total number of learning parameters: 263110761
== Model Initialized on GPU: 0
== Initial variables' sum: -131075.842, avg: -326.059

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))

tensor(False, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

Traceback (most recent call last):
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 405, in
main()
File "/data2/gwx/MonoDepthEst/VA-DepthNet/vadepthnet/train.py", line 399, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

cnexah · 2023-12-15T03:48:22Z

Thank you for your information.

Could you add the following code:
print(torch.isnan(x).any())
print(torch.isnan(att).any())
print(torch.isnan(A).any())
print(torch.isnan(ATA).any())
to :

VA-DepthNet/vadepthnet/networks/vadepthnet.py

Line 184 in 44061e1

And do you use FP 32 or FP 16?

guwenxiang1 · 2023-12-15T04:34:13Z

Thank you for your prompt reply. The print is:

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))

tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

Traceback (most recent call last):

cnexah · 2023-12-15T04:38:58Z

The output shows that there is NAN in your input image tensor x.
Could you check your input image reading and preprocessing?

guwenxiang1 · 2023-12-15T04:51:03Z

The DataLoader section only modified the file path concatenation part, and no changes were made to the preprocessing part.

cnexah · 2023-12-15T05:05:36Z

I see. I guess there is NAN value after reading the images.

guwenxiang1 · 2023-12-15T06:32:10Z

In the previous discussion, there are no abnormal values in the image and depth_gt in the line of code
depth_est, loss = model(image, depth_gt)
, right?

cnexah · 2023-12-15T07:13:28Z

Yes, but when you put the print function inside the model forward function, it shows the input image tensor contains NAN values.
Or could you remove the lines from 295 to 303 in the train.py, and check the input tensor again?
It might be important to run longer time.

guwenxiang1 · 2023-12-15T07:42:50Z

Is this what you mean?
The code is as follows,

            image = torch.autograd.Variable(sample_batched['image'].cuda(args.gpu, non_blocking=True)) 
            depth_gt = torch.autograd.Variable(sample_batched['depth'].cuda(args.gpu, non_blocking=True))
            print(torch.isnan(image).any())
            print(torch.isnan(depth_gt).any())
            print(torch.isinf(image).any())
            print(torch.isinf(depth_gt).any())
            print(image)
            print(depth_gt)
            # depth_est, loss = model(image, depth_gt)
            # loss.backward()
            # torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            # optimizer.step()

and there is a relatively long output. One of the output groups is as follows:
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor([[[[-1.1755, -1.1591, -1.1755, ..., 0.6245, 0.7728, 0.9203],
[-1.2086, -1.2418, -1.2086, ..., 0.6096, 0.7135, 0.8466],
[-1.2751, -1.2584, -1.2751, ..., 0.4305, 0.5500, 0.6690],
...,
[-1.1261, -1.1921, -1.2252, ..., 0.5649, 0.6096, 0.6690],
[-1.1261, -1.1921, -1.2584, ..., 0.5947, 0.5947, 0.7432],
[-1.1261, -1.2086, -1.2751, ..., 0.5202, 0.5649, 0.7432]],

     [[-0.7488, -0.7122, -0.7122,  ...,  1.0808,  1.2984,  1.4815],
      [-0.7671, -0.7855, -0.7488,  ...,  1.0304,  1.1982,  1.3818],
      [-0.8407, -0.8039, -0.8039,  ...,  0.8447,  0.9967,  1.1647],
      ...,
      [-0.8223, -0.8962, -0.9333,  ...,  1.1814,  1.2483,  1.3318],
      [-0.8407, -0.9147, -0.9892,  ...,  1.2149,  1.2483,  1.4317],
      [-0.8592, -0.9333, -1.0267,  ...,  1.1312,  1.1982,  1.3984]],

     [[-0.6364, -0.5874, -0.5874,  ...,  0.1597,  0.3301,  0.5145],
      [-0.6200, -0.6037, -0.5874,  ...,  0.1129,  0.2528,  0.4378],
      [-0.6200, -0.5874, -0.5548,  ...,  0.0348,  0.1129,  0.2837],
      ...,
      [-0.7350, -0.7681, -0.7847,  ...,  0.7430,  0.8036,  0.8942],
      [-0.7681, -0.7847, -0.8345,  ...,  0.8187,  0.8641,  0.9996],
      [-0.7847, -0.8179, -0.8847,  ...,  0.7582,  0.7733,  0.9093]]]],
   device='cuda:0')

tensor([[[[0.6445, 0.6445, 0.6445, ..., 0.5078, 0.5078, 0.5078],
[0.6445, 0.6406, 0.6406, ..., 0.5078, 0.5078, 0.5078],
[0.6406, 0.6406, 0.6406, ..., 0.5078, 0.5039, 0.5039],
...,
[0.6367, 0.6367, 0.6367, ..., 0.4883, 0.4844, 0.4844],
[0.6367, 0.6367, 0.6367, ..., 0.4883, 0.4844, 0.4883],
[0.6367, 0.6367, 0.6367, ..., 0.4844, 0.4844, 0.4844]]]],
device='cuda:0')

cnexah · 2023-12-15T07:46:15Z

Yes, it looks fine. Are all the outputs False?

Another question is, how do you initialize the swin transformer? pretrained or random?

guwenxiang1 · 2023-12-15T08:10:17Z

Yes, I searched for "True" in the command line, but there were no matches in the printed content. So all the outputs should be "False". Additionally, I initialized the Swin Transformer with swin_large_patch4_window12_384_22k.pth.

cnexah · 2023-12-15T08:16:10Z

I see, so the input image is fine.
Could you add the following code:
print(torch.isnan(x).any())
print(torch.isnan(x2).any())
print(torch.isnan(x3).any())
print(torch.isnan(x4).any())
print(torch.isnan(x5).any())
to :

VA-DepthNet/vadepthnet/networks/vadepthnet.py

Line 294 in 44061e1

I guess there might be NAN values after swin transformer. Do you use FP 16 or FP 32?

guwenxiang1 · 2023-12-15T08:37:20Z

I didn't make any additional settings, I just followed the steps in the GitHub repository for training. And the output results are as follows:
== Initial variables' sum: -130997.640, avg: -325.865
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(False, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
Traceback (most recent call last):

cnexah · 2023-12-15T09:06:55Z

I see. The input image tensor (x) is fine, but the extracted feature (x2) from swin transformer is NAN.
I guess the NAN starts from the second iteration, after the backward propagation of loss.

Could you try to replace the code:

VA-DepthNet/vadepthnet/networks/vadepthnet.py

Line 331 in 44061e1

return outs['scale_1'], var_loss + si_loss

by:
print(torch.isnan(si_loss).any())
return outs['scale_1'], si_loss

or
print(torch.isnan(si_loss).any())
return outs['scale_1'], var_loss + si_loss

guwenxiang1 · 2023-12-15T09:20:49Z

The print() result is : tensor(True, device='cuda:0')

cnexah · 2023-12-15T09:21:54Z

Only one output?

guwenxiang1 · 2023-12-15T09:23:38Z

Yes. The more complete output is as follows:
== Initial variables' sum: -131047.728, avg: -325.989
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(True, device='cuda:0')
Traceback (most recent call last):

cnexah · 2023-12-15T09:28:36Z

I see. There is NAN in the si loss at the first iteration.
It might comes from the output or the ground truth.

Could you add the following code:
print(torch.isnan(depth_prediction).any())
print(torch.isnan(reshaped_gt).any())
print(torch.isnan(diff).any())
to

VA-DepthNet/vadepthnet/networks/loss.py

Line 160 in 44061e1

guwenxiang1 · 2023-12-15T09:33:26Z

Thank you for your patient response. The output after adding the code is as follows:

== Initial variables' sum: -130969.684, avg: -325.795
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
Traceback (most recent call last):

cnexah · 2023-12-15T09:37:19Z

I see. The output from the network and the ground truth are both fine at the first iteration.
Could you replace the following code:

VA-DepthNet/vadepthnet/networks/loss.py

Line 162 in 44061e1

num_pixels = num_pixels.reshape(num_pixels.shape[0], -1).sum(dim=-1) + 1e-6

by:
num_pixels = num_pixels.reshape(num_pixels.shape[0], -1).sum(dim=-1) + 1e-1

After this, will the si loss still be NAN?

guwenxiang1 · 2023-12-15T09:46:24Z

The result of code : print(torch.isnan(si_loss).any()) is :
== Initial variables' sum: -130963.132, avg: -325.779
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
tensor(True, device='cuda:0')
Traceback (most recent call last):

cnexah · 2023-12-15T10:39:38Z

I see. I'm not sure whether the NAN in the loss comes from the first, or second, or third scale of prediction.
Could you add the following code:
return loss1
to:

VA-DepthNet/vadepthnet/networks/loss.py

Line 173 in 44061e1

guwenxiang1 · 2023-12-15T11:42:46Z

I printed the feat and loss1, and the results are:
== Initial variables' sum: -131122.057, avg: -326.174
/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))
feat
{'scale_1': tensor([[[[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf],
...,
[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf],
[inf, inf, inf, ..., inf, inf, inf]]]], device='cuda:0',
grad_fn=)}
tensor(nan, device='cuda:0', grad_fn=)
tensor(True, device='cuda:0')
Traceback (most recent call last):

CUDA out of memory #11

CUDA out of memory #11

Comments

guwenxiang1 commented Dec 5, 2023

cnexah commented Dec 6, 2023

guwenxiang1 commented Dec 6, 2023

cnexah commented Dec 6, 2023

guwenxiang1 commented Dec 6, 2023

cnexah commented Dec 7, 2023

guwenxiang1 commented Dec 7, 2023

cnexah commented Dec 7, 2023

guwenxiang1 commented Dec 11, 2023

cnexah commented Dec 13, 2023

guwenxiang1 commented Dec 13, 2023

cnexah commented Dec 14, 2023

guwenxiang1 commented Dec 14, 2023

cnexah commented Dec 14, 2023

guwenxiang1 commented Dec 14, 2023

cnexah commented Dec 15, 2023 • edited Loading

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023 • edited Loading

guwenxiang1 commented Dec 15, 2023

== Total number of parameters: 263110761 == Total number of learning parameters: 263110761 == Model Initialized on GPU: 0 == Initial variables' sum: -131075.842, avg: -326.059

tensor(False, device='cuda:0') tensor(False, device='cuda:0') tensor(False, device='cuda:0') tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead. warnings.warn(warning.format(ret))

tensor(False, device='cuda:0') tensor(True, device='cuda:0') tensor(True, device='cuda:0') tensor(True, device='cuda:0')

cnexah commented Dec 15, 2023 • edited Loading

guwenxiang1 commented Dec 15, 2023

Thank you for your prompt reply. The print is:

tensor(False, device='cuda:0') tensor(False, device='cuda:0') tensor(False, device='cuda:0') tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead. warnings.warn(warning.format(ret))

tensor(True, device='cuda:0') tensor(True, device='cuda:0') tensor(True, device='cuda:0') tensor(True, device='cuda:0')

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023 • edited Loading

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023 • edited Loading

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023

guwenxiang1 commented Dec 15, 2023

cnexah commented Dec 15, 2023 •

edited

Loading

cnexah commented Dec 15, 2023 •

edited

Loading

== Total number of parameters: 263110761
== Total number of learning parameters: 263110761
== Model Initialized on GPU: 0
== Initial variables' sum: -131075.842, avg: -326.059

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))

tensor(False, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

cnexah commented Dec 15, 2023 •

edited

Loading

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(False, device='cuda:0')

/home/ouc/anaconda3/envs/midas-py310/lib/python3.10/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
warnings.warn(warning.format(ret))

tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')
tensor(True, device='cuda:0')

cnexah commented Dec 15, 2023 •

edited

Loading

cnexah commented Dec 15, 2023 •

edited

Loading