Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Train: torch.cuda.OutOfMemoryError: CUDA out of memory #144

Open
Shaggy0815 opened this issue Aug 18, 2024 · 0 comments
Open

How to Train: torch.cuda.OutOfMemoryError: CUDA out of memory #144

Shaggy0815 opened this issue Aug 18, 2024 · 0 comments

Comments

@Shaggy0815
Copy link

Shaggy0815 commented Aug 18, 2024

Hi all,
I am following the instructions to install pixart-sigma on my local linux server. On 1.3 You are ready to train! I am getting now the following error:

python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=12345           train_scripts/train.py           configs/pixart_sigma_config/PixArt_sigma_xl2_img512_internalms.py           --load-from output/pretrained_models/PixArt-Sigma-XL-2-512-MS.pth           --work-dir output/your_first_pixart-exp           --debug
/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/home/test/.local/lib/python3.10/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
2024-08-18 10:48:52,906 - PixArt - INFO - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

2024-08-18 10:48:52,941 - PixArt - INFO - Config: 
data_root = 'pixart-sigma-toy-dataset'
data = dict(
    type='InternalDataMSSigma',
    root='InternData',
    image_list_json=['data_info.json'],
    transform='default_train',
    load_vae_feat=False,
    load_t5_feat=False)
image_size = 512
train_batch_size = 2
eval_batch_size = 16
use_fsdp = False
valid_num = 0
fp32_attention = True
model = 'PixArtMS_XL_2'
aspect_ratio_type = 'ASPECT_RATIO_512'
multi_scale = True
pe_interpolation = 1.0
qk_norm = False
kv_compress = False
kv_compress_config = dict(sampling=None, scale_factor=1, kv_compress_layer=[])
num_workers = 10
train_sampling_steps = 1000
visualize = True
deterministic_validation = False
eval_sampling_steps = 500
model_max_length = 300
lora_rank = 4
num_epochs = 10
gradient_accumulation_steps = 1
grad_checkpointing = True
gradient_clip = 0.01
gc_step = 1
auto_lr = dict(rule='sqrt')
validation_prompts = [
    'dog',
    'portrait photo of a girl, photograph, highly detailed face, depth of field',
    'Self-portrait oil painting, a beautiful cyborg with golden hair, 8k',
    'Astronaut in a jungle, cold color palette, muted colors, detailed, 8k',
    'A photo of beautiful mountain with realistic sunset and blue lake, highly detailed, masterpiece'
]
optimizer = dict(
    type='CAMEWrapper',
    lr=2e-05,
    weight_decay=0.0,
    eps=(1e-30, 1e-16),
    betas=(0.9, 0.999, 0.9999))
lr_schedule = 'constant'
lr_schedule_args = dict(num_warmup_steps=1000)
save_image_epochs = 1
save_model_epochs = 5
save_model_steps = 2500
sample_posterior = True
mixed_precision = 'fp16'
scale_factor = 0.13025
ema_rate = 0.9999
tensorboard_mox_interval = 50
log_interval = 1
cfg_scale = 4
mask_type = 'null'
num_group_tokens = 0
mask_loss_coef = 0.0
load_mask_index = False
vae_pretrained = 'output/pretrained_models/pixart_sigma_sdxlvae_T5_diffusers/vae'
load_from = 'output/pretrained_models/PixArt-Sigma-XL-2-512-MS.pth'
resume_from = None
snr_loss = False
real_prompt_ratio = 0.5
class_dropout_prob = 0.1
work_dir = 'output/your_first_pixart-exp'
s3_work_dir = None
micro_condition = False
seed = 43
skip_step = 0
loss_type = 'huber'
huber_c = 0.001
num_ddim_timesteps = 50
w_max = 15.0
w_min = 3.0
ema_decay = 0.95
image_list_json = ['data_info.json']

2024-08-18 10:48:52,941 - PixArt - INFO - World_size: 1, seed: 43
2024-08-18 10:48:52,941 - PixArt - INFO - Initializing: DDP for training
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.28s/it]
Traceback (most recent call last):
  File "/home/test/AI/PixArt-sigma/train_scripts/train.py", line 359, in <module>
    args.pipeline_load_from, subfolder="text_encoder", torch_dtype=torch.float16).to(accelerator.device)
  File "/home/test/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
    return super().to(*args, **kwargs)
  File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 7.78 GiB total capacity; 7.17 GiB already allocated; 69.38 MiB free; 7.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8550) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-18_10:49:19
  host      : test-MD34764
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8550)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

GPU is nvidia 2060 super with 8GB

I understand that there is an issue with the memory allocation however not sure howto resolve it and in first place (at least in my mind) it should not happen anyway :)

Anyone can help please?
Thanks a lot in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant