-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256
Comments
Transferring as this appears to be more of an accelerate related issue (I think) |
What GPU are you using? |
I am using h800 80G with nvlink. and have set nccl to use nvlink for p2p |
with py-spy, i found it hang on backward or autograd function |
from hub, it can be found qlora is worked. but fft setting didn't |
Hello @dumpmemory, we were able to do full fine-tuning for Mixtral MOE model, seems like an issue with axolotl. |
Does it work without offloading? |
It can't be fitted in to 80G memory without offload. |
cloud you provide your setting ? in my case, the training will hang around 1 hour training time and |
after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours ! |
i has update nccl to 2.19.3 ,with deepspeed zero3_offload, it hang. with zero2_offload, it oom on gpu |
I face similar problems... And I noticed that the functions are not stuck at backward. In fact there is a process stuck during inference, and all the other processes wait the process, then it seems like it stuck at backward. I will try your solution and report the results soon. |
Is therer a script for reproducing? |
I am facing the same issue! But I use the huggingface transformer trainer with Deepspeed Zero3; and the process of hanging can be reproduced with the same random seed; It just train into the step 21 and then no log printing happens; the GPU memory stays at full memory (e.g., 40G) with an utilization rate of 100%; But NO DATA EXCHANGE is observed from the RDMA network; So actually the process just died at this moment until DDP_TIME later (e.g., 1800 seconds) watchdog forces the entire process down. |
BTW, I am training the Mixtral MOE modified version; But I keep the original MOE block |
Any update? |
@dumpmemory Can u share your complete deepspeed config ? |
And is there any other requirement of launch scripts? So as model and optimizer initlization? |
pls check my responses in hiyouga/LLaMA-Factory#1845 |
@pacman100 Is therer a script for reproducing? |
hang after several steps using Zero3. You can try this patch for similar cases. |
can u try this deepspeedai/DeepSpeed#4966 ? |
System Info
transformers
version: 4.36.0Who can help?
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
no time out
The text was updated successfully, but these errors were encountered: