CUDA out of memory #9

jinkun-hao · 2024-10-22T12:46:55Z

环境与requirements.txt中一致，除了用的是bitsandbytes-0.41.3
在单张80G H100上运行bash fintune_lora_llama3_8B_chat.sh，报错CUDA out of memory
下面是完整的log，请问是什么原因导致的

[2024-10-22 12:37:30,959] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-10-22 12:37:32,556] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-10-22 12:37:32,556] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-10-22 12:37:32,556] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
quantization_config： None
[2024-10-22 12:37:39,510] [INFO] [partition_parameters.py:453:exit] finished initializing model with 8.03B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:06<00:00, 1.70s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
['o_proj', 'k_proj', 'q_proj', 'v_proj']
None
trainable params: 54,525,952 || all params: 8,084,787,200 || trainable%: 0.6744265575722265
Loading data...
Formatting inputs...Skip in lazy mode
/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
warnings.warn(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/haojk/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
Emitting ninja build file /home/haojk/.cache/torch_extensions/py38_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.6103599071502686 seconds
Using /home/haojk/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
Emitting ninja build file /home/haojk/.cache/torch_extensions/py38_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3512094020843506 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
Using /home/haojk/.cache/torch_extensions/py38_cu118 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00031566619873046875 seconds
0%| | 0/3300 [00:00<?, ?it/s]Traceback (most recent call last):
File "../finetune_llama3.py", line 434, in
train()
File "../finetune_llama3.py", line 427, in train
trainer.train()
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/trainer.py", line 2902, in training_step
loss = self.compute_loss(model, inputs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/trainer.py", line 2925, in compute_loss
outputs = model(**inputs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
loss = self.module(*inputs, **kwargs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/peft/peft_model.py", line 918, in forward
return self.base_model(
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 94, in forward
return self.model.forward(*args, **kwargs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 1176, in forward
outputs = self.model(
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 993, in forward
causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
File "/home/haojk/miniconda3/envs/guangke/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 1074, in _update_causal_mask
causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * min_dtype
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 GiB (GPU 0; 79.11 GiB total capacity; 50.06 GiB already allocated; 20.25 GiB free; 52.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The text was updated successfully, but these errors were encountered:

taishan1994 · 2024-10-23T02:24:37Z

model_max_length 设置小一点看看

jinkun-hao · 2024-10-23T12:39:19Z

model_max_length设为10也是一样out of memory，请问是怎么在24G卡上跑起来的呢

taishan1994 · 2024-10-26T09:48:22Z

NCCL_P2P_DISABLE=1
NCCL_IB_DISABLE=1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
torchrun
--nproc_per_node 8
--nnodes 1
--node_rank 0
--master_addr localhost
--master_port 6601
../finetune_llama3.py
--model_name_or_path "../model_hub/LLM-Research/Meta-Llama-3-8B-Instruct/"
--data_path "../data/Belle_sampled_qwen.json"
--bf16 True
--output_dir "../output/llama3_8B_lora"
--num_train_epochs 100
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 5
--save_total_limit 1
--learning_rate 1e-5
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--report_to "none"
--model_max_length 512
--gradient_checkpointing True
--lazy_preprocess True
--deepspeed "../config/ds_config_zero3_72B.json"
--use_lora

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory #9

CUDA out of memory #9

jinkun-hao commented Oct 22, 2024

taishan1994 commented Oct 23, 2024

jinkun-hao commented Oct 23, 2024

taishan1994 commented Oct 26, 2024

CUDA out of memory #9

CUDA out of memory #9

Comments

jinkun-hao commented Oct 22, 2024

taishan1994 commented Oct 23, 2024

jinkun-hao commented Oct 23, 2024

taishan1994 commented Oct 26, 2024