-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some confused of <CUDA out of memory> #39
Comments
Hi, Can you send me your logs? Last time I tested mbart-50 training, 1024 was the maximum batch size I could train on with my 32 GB. Even then I would run out of memory on and off. So 40 GB should be ok with 2048 size batches but I cant be 100% sure. Its 99% not a problem with my code. As to why this happens sporadically? 2048 might be at the edge of your GPUs maximum capacity and sometimes when the pytorch allocator tries to allocate memory while at the edge of capacity, OOMs happen. That being said, a crash at 300k seems to be a one off thing that happens with fairseq too. Also you seem to have modified the pretraining script: pretrain_nmt_new.py I also have questions about your command:
If I were you I would do the following:
Then I would go to the following code block:
I would change the elif part to use the AutoTokenizer or create an if else under the elif part to use the MBART50 tokenizer if "50" is present in the tokenizer name. This is a temporary fix I know. I will consider fixing the whole flow to make it easier to resume the crashed fine-tuning runs of official models. |
A small edit: In your previous run there should be a folder with the suffix "_deploy". Set --tokenizer_name_or_path to this folder and you can resume training. I am also considering splitting the --use_official_pretrained flag into two, one for tokenizer and one for model so that you can use the official tokenizer but a non official locally fine tuned model which would be perfect for your situation which needs resuming a crashed fine tuning run of an official model. This will eliminate the need for modifying the code. |
Thanks for your patience @prajdabre the following is part of my run_train.log:
-- Process 0 terminated with the following error: I don't know if this is the log you want to see,
Then I will explain your question 1, why is --encoder_ffn_dim=128, previous time I just set a value at random for --encoder_ffn_dim to test; and find that in generated And then for your question 2, this morning, when I found my task crashed, I backed up all generated models, logs, and _deploy; and then I copy the /****/mbart-50-v1_deploy/pytorch_model.bin into the dir of pretrain_model/mbart-50 to replace the open source model downloaded from huggingface; The reason I do this is that I want to resume the crashed task and continue pre train based on my previous model; But after reading your reply above, I think there may be a problem with this approach, and I have to think about how to adjust it correctly. and as for the generating model before task crashed, there also have big scale model just as 6.9G, and consist of And as for the script for this part, I commented out some code that saving the model and loading the checkpoint and for this part, I do something similar to achieve the same goal
Then no other changes in this script; |
And for the resume of crashed task, Is it more reasonable to load this |
Hi, I just pushed some code a moment ago with some explanation in the commit which I am pasting here. Changes:
With these changes, you can keep your original fine-tuning command (the one which finished till 300k iterations) with the following changes: --locally_fine_tuned_model --batch_size 1024 --multistep_optimizer_steps 2 --multistep_optimizer_steps 2 will simulate a 2048 batch size. Please try and let me know if it works. If yes then you should see the model resuming from iteration 300000. |
Thank you for reply @prajdabre And I use your latest version of pretrain_nmt.py and common_utils.py, and have one confused point, this part is corresponding to reloading the latest local model_state of And the following part is corresponding to reloading the optimizer, right? and the following is my script's command: and I confirm with you that in your new version of pretrain_nmt.py, I just change two parts; set these two changes in order to resolve the problem of #35 |
oh, I see, I found that in my commands, I accidentally set this parameter |
Exactly! You got it! Hope it runs smoothly now. |
@prajdabre It seems successful when I remove the |
yeah, thanks |
I think I figured out why your model run crashes. The part you commented out contains del checkpoint_dict Actually this dict maintains a copy of the model parameters and this takes up GPU space. Originally this would be deleted but since you commented it out it remains in memory leading to the OOM error. Note that this issue will exist for your previous run which trained up to 300k params. But with the current run the issue won't exist since the checkpoint_dict is deleted. Hope it makes sense. |
In fact, I also suspect this is the reason, but I am not too sure, and this is not the first time that the OOM problem has occurred. I will observe it again during this time, and I hope this problem has been solved. The current GPU memory usage is stable at 30G/40G, Then because of the --multistep_optimizer_steps 2, the time consumed per 100 batches is less than doubled |
Hi, when I use
train_mbart_model.sh
to continue pre train the mBart-50, after 300k batches, the error occurred as:RuntimeError: CUDA out of memory. Tried to allocate 1.90 GiB (GPU 0; 39.44 GiB total capacity; 19.93 GiB already allocated; 1.31 GiB free; 36.14 GiB reserved in total by PyTorch)
, and at this point, one epoch has been complete; and then I restart the code to continue pre train with batch size from 2048 to 1024 and with checkpoint that just generated; I'm confused, why this problem suddenly appeared after the task ran for so long, or is there some hidden problem in the program?I don't know why this happens, so I'm worried that this problem will occur after the task starts again. and I don't know whether it work when I only change the batch size;
I use 4 GPUS: A100, NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7;
and total memory of per GPU is : 40G;
and during this time, there probably no other tasks to preempt resources
and my script setting is:
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Change to the GPU ID corresponding to a GPU that is free. export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' nohup python pretrain_nmt_new.py -n 1 -nr 0 -g 4 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 2048 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --num_batches 10000000 --save_intermediate_checkpoints --data_sampling_temperature 1.0 --hard_truncate_length 512 --max_length 512 --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &
The text was updated successfully, but these errors were encountered: