Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256

Closed
2 of 4 tasks
dumpmemory opened this issue Dec 14, 2023 · 22 comments
Closed
2 of 4 tasks

NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256

dumpmemory opened this issue Dec 14, 2023 · 22 comments

Comments

@dumpmemory
Copy link

dumpmemory commented Dec 14, 2023

System Info

  • transformers version: 4.36.0
  • Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.23.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0a0+b5021ba (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: Yes

Who can help?

@pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. https://github.com/OpenAccess-AI-Collective/axolotl for full fine-tuning Mixtral-7Bx8 with deepspeed zero3 offload
  2. faced with nccl timeout during backward after 1:30 hours training

Expected behavior

no time out

@amyeroberts
Copy link

Transferring as this appears to be more of an accelerate related issue (I think)

@amyeroberts amyeroberts transferred this issue from huggingface/transformers Dec 14, 2023
@muellerzr
Copy link
Collaborator

What GPU are you using?

@dumpmemory
Copy link
Author

dumpmemory commented Dec 15, 2023

What GPU are you using?

I am using h800 80G with nvlink. and have set nccl to use nvlink for p2p

@dumpmemory
Copy link
Author

with py-spy, i found it hang on backward or autograd function

@dumpmemory
Copy link
Author

from hub, it can be found qlora is worked. but fft setting didn't

@pacman100
Copy link
Contributor

Hello @dumpmemory, we were able to do full fine-tuning for Mixtral MOE model, seems like an issue with axolotl.

@pacman100
Copy link
Contributor

Does it work without offloading?

@dumpmemory
Copy link
Author

Does it work without offloading?

It can't be fitted in to 80G memory without offload.

@dumpmemory
Copy link
Author

dumpmemory commented Dec 15, 2023

Hello @dumpmemory, we were able to do full fine-tuning for Mixtral MOE model, seems like an issue with axolotl.

cloud you provide your setting ? in my case, the training will hang around 1 hour training time and
I am using deepspeed cpu adamw optimizer

@dumpmemory
Copy link
Author

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

@awzhgw
Copy link

awzhgw commented Jan 14, 2024

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

i has update nccl to 2.19.3 ,with deepspeed zero3_offload, it hang. with zero2_offload, it oom on gpu

@tingxueronghua
Copy link

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

I face similar problems... And I noticed that the functions are not stuck at backward. In fact there is a process stuck during inference, and all the other processes wait the process, then it seems like it stuck at backward.

I will try your solution and report the results soon.

@xs1997zju
Copy link

Is therer a script for reproducing?

@yuleiqin
Copy link

I am facing the same issue! But I use the huggingface transformer trainer with Deepspeed Zero3; and the process of hanging can be reproduced with the same random seed; It just train into the step 21 and then no log printing happens; the GPU memory stays at full memory (e.g., 40G) with an utilization rate of 100%; But NO DATA EXCHANGE is observed from the RDMA network; So actually the process just died at this moment until DDP_TIME later (e.g., 1800 seconds) watchdog forces the entire process down.

@yuleiqin
Copy link

I am facing the same issue! But I use the huggingface transformer trainer with Deepspeed Zero3; and the process of hanging can be reproduced with the same random seed; It just train into the step 21 and then no log printing happens; the GPU memory stays at full memory (e.g., 40G) with an utilization rate of 100%; But NO DATA EXCHANGE is observed from the RDMA network; So actually the process just died at this moment until DDP_TIME later (e.g., 1800 seconds) watchdog forces the entire process down.

BTW, I am training the Mixtral MOE modified version; But I keep the original MOE block

@xs1997zju
Copy link

Any update?

@xs1997zju
Copy link

@dumpmemory Can u share your complete deepspeed config ?

@xs1997zju
Copy link

And is there any other requirement of launch scripts? So as model and optimizer initlization?

@dumpmemory
Copy link
Author

pls check my responses in hiyouga/LLaMA-Factory#1845

@xs1997zju
Copy link

Hello @dumpmemory, we were able to do full fine-tuning for Mixtral MOE model, seems like an issue with axolotl.

@pacman100 Is therer a script for reproducing?

@ftgreat
Copy link

ftgreat commented Jan 24, 2024

hang after several steps using Zero3. You can try this patch for similar cases.
hiyouga/LLaMA-Factory#2315

@dumpmemory
Copy link
Author

hang after several steps using Zero3. You can try this patch for similar cases. hiyouga/LLaMA-Factory#2315

can u try this deepspeedai/DeepSpeed#4966 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants