CUDA OOM #47

zhengbi-yong · 2024-11-27T07:53:05Z

When I ran python demo.py --input demo_data/lady-running --output_dir demo_tmp --seq_name lady-running
I have torch.OutOfMemoryError

(monst3r)  sisyphus@sisyphus-dual4090  ~/Projects/monst3r   main ±  python demo.py --input demo_data/lady-running --output_dir demo_tmp --seq_name lady-running            
/home/sisyphus/Projects/monst3r/dust3r/cloud_opt/base_opt.py:399: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  @torch.cuda.amp.autocast(enabled=False)
... loading model from checkpoints/MonST3R_PO-TA-S-W_ViTLarge_BaseDecoder_512_dpt.pth
/home/sisyphus/Projects/monst3r/dust3r/model.py:29: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  ckpt = torch.load(model_path, map_location='cpu')
instantiating : AsymmetricCroCo3DStereo(pos_embed='RoPE100', patch_embed_cls='PatchEmbedDust3R', img_size=(512, 512), head_type='dpt', output_mode='pts3d', depth_mode=('exp', -inf, inf), conf_mode=('exp', 1, inf), enc_embed_dim=1024, enc_depth=24, enc_num_heads=16, dec_embed_dim=768, dec_depth=12, dec_num_heads=12, freeze='encoder', landscape_only=False)
Freezing encoder parameters
<All keys matched successfully>
Outputting stuff in demo_tmp
>> Loading a list of 65 items
 - Adding demo_data/lady-running/00000.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00001.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00002.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00003.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00004.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00005.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00006.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00007.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00008.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00009.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00010.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00011.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00012.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00013.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00014.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00015.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00016.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00017.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00018.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00019.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00020.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00021.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00022.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00023.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00024.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00025.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00026.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00027.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00028.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00029.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00030.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00031.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00032.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00033.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00034.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00035.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00036.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00037.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00038.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00039.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00040.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00041.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00042.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00043.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00044.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00045.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00046.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00047.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00048.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00049.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00050.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00051.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00052.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00053.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00054.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00055.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00056.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00057.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00058.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00059.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00060.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00061.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00062.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00063.jpg with resolution 854x480 --> 512x288
 - Adding demo_data/lady-running/00064.jpg with resolution 854x480 --> 512x288
 (Found 65 images)
>> Inference with model on 600 image pairs
  0%|                     | 0/600 [00:00<?, ?it/s]/home/sisyphus/Projects/monst3r/dust3r/inference.py:70: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=bool(use_amp)):
/home/sisyphus/Projects/monst3r/dust3r/model.py:209: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
/home/sisyphus/Projects/monst3r/dust3r/inference.py:74: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(enabled=False):
100%|███████████| 600/600 [00:25<00:00, 23.93it/s]
precomputing flow...
/home/sisyphus/Projects/monst3r/third_party/raft.py:64: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(args.model)
Loaded pretrained RAFT model from third_party/RAFT/models/Tartan-C-T-TSKH-spring540x960-M.pth
  0%|                      | 0/50 [00:00<?, ?it/s]/home/sisyphus/anaconda3/envs/monst3r/lib/python3.11/site-packages/torch/functional.py:534: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1729647429097/work/aten/src/ATen/native/TensorShape.cpp:3595.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
100%|█████████████| 50/50 [00:22<00:00,  2.26it/s]
flow precomputed
100%|███████████| 300/300 [00:11<00:00, 25.06it/s]
propagate in video: 100%|█| 65/65 [00:01<00:00, 36
propagate in video: 100%|█| 65/65 [00:01<00:00, 37
 init edge (36*,41*) score=np.float64(75.08370971679688)
 init edge (36,39*) score=np.float64(72.14512634277344)
 init edge (36,43*) score=np.float64(71.0331802368164)
 init edge (31*,36) score=np.float64(69.88085174560547)
 init edge (35*,36) score=np.float64(69.50623321533203)
 init edge (33*,36) score=np.float64(68.75778198242188)
 init edge (35,34*) score=np.float64(68.12627410888672)
 init edge (35,32*) score=np.float64(67.92353057861328)
 init edge (36,45*) score=np.float64(65.36090087890625)
 init edge (28*,35) score=np.float64(64.32600402832031)
 init edge (29*,36) score=np.float64(63.106346130371094)
 init edge (27*,36) score=np.float64(62.209922790527344)
 init edge (27,30*) score=np.float64(59.32157897949219)
 init edge (26*,31) score=np.float64(54.36586380004883)
 init edge (39,48*) score=np.float64(54.24318313598633)
 init edge (43,52*) score=np.float64(51.04729080200195)
 init edge (28,25*) score=np.float64(50.92348098754883)
 init edge (27,24*) score=np.float64(50.71531295776367)
 init edge (41,50*) score=np.float64(48.895809173583984)
 init edge (59*,52) score=np.float64(47.21723175048828)
 init edge (59,64*) score=np.float64(45.50202560424805)
 init edge (24,23*) score=np.float64(43.378780364990234)
 init edge (27,22*) score=np.float64(42.379398345947266)
 init edge (21*,24) score=np.float64(38.36738967895508)
 init edge (27,20*) score=np.float64(34.135589599609375)
 init edge (19*,24) score=np.float64(32.27302169799805)
 init edge (18*,19) score=np.float64(28.75078582763672)
 init edge (15*,22) score=np.float64(26.890655517578125)
 init edge (17*,18) score=np.float64(25.864116668701172)
 init edge (16*,23) score=np.float64(24.763416290283203)
 init edge (13*,18) score=np.float64(24.73274040222168)
 init edge (10*,15) score=np.float64(20.788616180419922)
 init edge (11*,16) score=np.float64(19.95123863220215)
 init edge (6*,11) score=np.float64(19.225000381469727)
 init edge (1*,10) score=np.float64(18.201091766357422)
 init edge (4*,11) score=np.float64(17.85396957397461)
 init edge (35,42*) score=np.float64(74.03719329833984)
 init edge (35,40*) score=np.float64(72.90853881835938)
 init edge (35,38*) score=np.float64(72.0782241821289)
 init edge (35,44*) score=np.float64(70.10966491699219)
 init edge (37*,40) score=np.float64(67.8114013671875)
 init edge (37,46*) score=np.float64(61.162193298339844)
 init edge (42,47*) score=np.float64(53.49862289428711)
 init edge (42,51*) score=np.float64(52.294639587402344)
 init edge (40,49*) score=np.float64(51.81222152709961)
 init edge (56*,59) score=np.float64(50.528812408447266)
 init edge (51,60*) score=np.float64(48.431884765625)
 init edge (46,55*) score=np.float64(47.660011291503906)
 init edge (56,61*) score=np.float64(47.41448974609375)
 init edge (59,54*) score=np.float64(47.35690689086914)
 init edge (54,57*) score=np.float64(45.83323669433594)
 init edge (54,63*) score=np.float64(45.600563049316406)
 init edge (14*,17) score=np.float64(28.082401275634766)
 init edge (12*,17) score=np.float64(26.88484764099121)
 init edge (5*,10) score=np.float64(20.78989601135254)
 init edge (2*,11) score=np.float64(20.3303165435791)
 init edge (2,9*) score=np.float64(19.79378318786621)
 init edge (7*,12) score=np.float64(19.537267684936523)
 init edge (3*,12) score=np.float64(19.106111526489258)
 init edge (5,8*) score=np.float64(18.861652374267578)
 init edge (0*,9) score=np.float64(16.9484920501709)
 init edge (53*,60) score=np.float64(50.2554817199707)
 init edge (55,58*) score=np.float64(48.09707260131836)
 init edge (53,62*) score=np.float64(47.796112060546875)
flow loss: 9.280871391296387
 init loss = 0.14837846159934998
Global alignement - optimizing for:
['pw_poses', 'im_depthmaps', 'im_poses', 'im_focals']
 10%| | 30/300 [00:04<00:38,  6.97it/s, lr=0.00913
Traceback (most recent call last):
  File "/home/sisyphus/Projects/monst3r/demo.py", line 338, in <module>
    scene, outfile, imgs = recon_fun(
                           ^^^^^^^^^^
  File "/home/sisyphus/Projects/monst3r/demo.py", line 124, in get_reconstructed_scene
    loss = scene.compute_global_alignment(init='mst', niter=niter, schedule=schedule, lr=lr)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/anaconda3/envs/monst3r/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/Projects/monst3r/dust3r/cloud_opt/base_opt.py", line 414, in compute_global_alignment
    return global_alignment_loop(self, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/Projects/monst3r/dust3r/cloud_opt/base_opt.py", line 479, in global_alignment_loop
    loss, lr = global_alignment_iter(net, bar.n, niter, lr_base, lr_min, optimizer, schedule, 
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/Projects/monst3r/dust3r/cloud_opt/base_opt.py", line 511, in global_alignment_iter
    loss = net(epoch=cur_iter)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/anaconda3/envs/monst3r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/anaconda3/envs/monst3r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/Projects/monst3r/dust3r/cloud_opt/optimizer.py", line 520, in forward
    ego_flow_1_2, _ = self.depth_wrapper(R1, T1, R2, T2, disp_1, K_2, inv_K_1)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/anaconda3/envs/monst3r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/anaconda3/envs/monst3r/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/Projects/monst3r/dust3r/utils/goem_opt.py", line 526, in forward
    return warp_by_disp(src_R, src_t, tgt_R, tgt_t, K, src_disp, self.coord, inv_K, debug_mode, use_depth)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sisyphus/Projects/monst3r/dust3r/utils/goem_opt.py", line 233, in warp_by_disp
    tgt_coord = torch.matmul(H_mat, coord) + flat_disp * \
                                             ^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1014.00 MiB. GPU 0 has a total capacity of 23.54 GiB of which 89.12 MiB is free. Including non-PyTorch memory, this process has 22.29 GiB memory in use. Of the allocated memory 20.25 GiB is allocated by PyTorch, and 1.53 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I am using RTX4090

The text was updated successfully, but these errors were encountered:

lxin98 · 2024-11-27T14:26:14Z

same issue to me ，still to figure out how to disable the flow_loss，and If only 9 images are used, there won't be this problem

Junyi42 · 2024-11-27T22:25:23Z

same issue to me ，still to figure out how to disable the flow_loss，and If only 9 images are used, there won't be this problem

Hi @lxin98,

To disable the flow_loss, you can set flow_loss_weight=0 in the following line:

monst3r/demo.py

Line 357 in ce88f01

flow_loss_weight=0.01,

Currently, running the "lady-running" sequence with the default setup requires 33G VRAM. Therefore, CUDA OOM is expected for hardware with 24GB VRAM (e.g., RTX4090).

These are the tricks to overcome OOM issue:

(w/o performance drop): implement a window-wise optimization here instead of current implementation that batches all pairs together (this takes time to implement)
(w/ a little performance drop): turn off the flow loss here
(w/ performance drop): use smaller image size Memory issue #18
(w/ performance drop): use sparser video graph GPU Memory OOM #28
(much slower): process with CPU fix demo.py with --device cpu and no CUDA device available #46

Hope this helps!

Best

zhengbi-yong · 2024-11-28T11:56:09Z

Thank you, I disable the flow_loss, and it worked.

lxin98 · 2024-11-29T01:56:40Z

Thank you for your excellent work and assistance @Junyi42

npmhung · 2024-12-12T22:11:55Z

same issue to me ，still to figure out how to disable the flow_loss，and If only 9 images are used, there won't be this problem

Hi @lxin98,

To disable the flow_loss, you can set flow_loss_weight=0 in the following line:

monst3r/demo.py

Line 357 in ce88f01

flow_loss_weight=0.01,

Currently, running the "lady-running" sequence with the default setup requires 33G VRAM. Therefore, CUDA OOM is expected for hardware with 24GB VRAM (e.g., RTX4090).

These are the tricks to overcome OOM issue:

(w/o performance drop): implement a window-wise optimization here instead of current implementation that batches all pairs together (this takes time to implement)

(w/ a little performance drop): turn off the flow loss here

(w/ performance drop): use smaller image size Memory issue #18

(w/ performance drop): use sparser video graph GPU Memory OOM #28

(much slower): process with CPU fix demo.py with --device cpu and no CUDA device available #46

Hope this helps!

Best

I tried to turn off the flow-based loss. However, after turning it off, the result is much worse for the lady running demo. The model could no longer detect the lady as moving object anymore. The dynamic mask is completly black, the fact that indicate the whole scene is static.

Noted: I only used the first 30 frames of the whole video.

lxin98 · 2024-12-13T11:15:07Z

@npmhung
That's right, I've also encountered this situation

Junyi42 · 2024-12-26T22:23:55Z

Hi,

I have updated an implementation for memory-efficient global alignment: #59 which comes with less effect on performance.

Hope this helps.

YunjieYu · 2025-03-05T13:54:17Z

@npmhung @lxin98 @zhengbi-yong Hi, everyone, I just submitted a merge request for a window-wise optimization. Waiting for the author to review #72. Now one can directly optimize a long video with a larger number of frames, and obtain expected results. Please enjoy these changes!

zhengbi-yong closed this as completed Nov 28, 2024

This was referenced Mar 22, 2025

Minor typos in the training code #71

Merged

implement a window-wise mode for improving the memory-speed trade-off #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM #47

CUDA OOM #47

zhengbi-yong commented Nov 27, 2024

lxin98 commented Nov 27, 2024

Junyi42 commented Nov 27, 2024

zhengbi-yong commented Nov 28, 2024

lxin98 commented Nov 29, 2024

npmhung commented Dec 12, 2024 •

edited

Loading

lxin98 commented Dec 13, 2024

Junyi42 commented Dec 26, 2024

YunjieYu commented Mar 5, 2025

CUDA OOM #47

CUDA OOM #47

Comments

zhengbi-yong commented Nov 27, 2024

lxin98 commented Nov 27, 2024

Junyi42 commented Nov 27, 2024

zhengbi-yong commented Nov 28, 2024

lxin98 commented Nov 29, 2024

npmhung commented Dec 12, 2024 • edited Loading

lxin98 commented Dec 13, 2024

Junyi42 commented Dec 26, 2024

YunjieYu commented Mar 5, 2025

npmhung commented Dec 12, 2024 •

edited

Loading