NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256

dumpmemory · 2023-12-14T14:07:22Z

System Info

transformers version: 4.36.0
Platform: Linux-5.4.119-19.0009.28-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.23.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0a0+b5021ba (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?: Yes

Who can help?

@pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

https://github.com/OpenAccess-AI-Collective/axolotl for full fine-tuning Mixtral-7Bx8 with deepspeed zero3 offload
faced with nccl timeout during backward after 1:30 hours training

Expected behavior

no time out

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-12-14T17:56:17Z

Transferring as this appears to be more of an accelerate related issue (I think)

muellerzr · 2023-12-14T18:15:10Z

What GPU are you using?

dumpmemory · 2023-12-15T02:04:54Z

What GPU are you using?

I am using h800 80G with nvlink. and have set nccl to use nvlink for p2p

dumpmemory · 2023-12-15T03:34:06Z

with py-spy, i found it hang on backward or autograd function

dumpmemory · 2023-12-15T03:53:01Z

from hub, it can be found qlora is worked. but fft setting didn't

pacman100 · 2023-12-15T15:08:22Z

Hello @dumpmemory, we were able to do full fine-tuning for Mixtral MOE model, seems like an issue with axolotl.

pacman100 · 2023-12-15T15:09:19Z

Does it work without offloading?

dumpmemory · 2023-12-15T15:48:14Z

Does it work without offloading?

It can't be fitted in to 80G memory without offload.

dumpmemory · 2023-12-15T15:49:13Z

Hello @dumpmemory, we were able to do full fine-tuning for Mixtral MOE model, seems like an issue with axolotl.

cloud you provide your setting ? in my case, the training will hang around 1 hour training time and
I am using deepspeed cpu adamw optimizer

dumpmemory · 2023-12-16T00:55:07Z

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

awzhgw · 2024-01-14T13:38:13Z

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

i has update nccl to 2.19.3 ,with deepspeed zero3_offload, it hang. with zero2_offload, it oom on gpu

tingxueronghua · 2024-01-17T10:56:04Z

after update nccl to 2.19.3 with nvidia 23-10, it was running 7 hours !

I face similar problems... And I noticed that the functions are not stuck at backward. In fact there is a process stuck during inference, and all the other processes wait the process, then it seems like it stuck at backward.

I will try your solution and report the results soon.

xs1997zju · 2024-01-22T10:59:21Z

Is therer a script for reproducing?

yuleiqin · 2024-01-22T16:15:45Z

I am facing the same issue! But I use the huggingface transformer trainer with Deepspeed Zero3; and the process of hanging can be reproduced with the same random seed; It just train into the step 21 and then no log printing happens; the GPU memory stays at full memory (e.g., 40G) with an utilization rate of 100%; But NO DATA EXCHANGE is observed from the RDMA network; So actually the process just died at this moment until DDP_TIME later (e.g., 1800 seconds) watchdog forces the entire process down.

yuleiqin · 2024-01-22T16:16:20Z

I am facing the same issue! But I use the huggingface transformer trainer with Deepspeed Zero3; and the process of hanging can be reproduced with the same random seed; It just train into the step 21 and then no log printing happens; the GPU memory stays at full memory (e.g., 40G) with an utilization rate of 100%; But NO DATA EXCHANGE is observed from the RDMA network; So actually the process just died at this moment until DDP_TIME later (e.g., 1800 seconds) watchdog forces the entire process down.

BTW, I am training the Mixtral MOE modified version; But I keep the original MOE block

xs1997zju · 2024-01-23T02:23:49Z

Any update?

xs1997zju · 2024-01-23T03:35:48Z

@dumpmemory Can u share your complete deepspeed config ?

xs1997zju · 2024-01-23T03:36:49Z

And is there any other requirement of launch scripts? So as model and optimizer initlization?

dumpmemory · 2024-01-23T04:09:49Z

pls check my responses in hiyouga/LLaMA-Factory#1845

xs1997zju · 2024-01-23T20:02:45Z

Hello @dumpmemory, we were able to do full fine-tuning for Mixtral MOE model, seems like an issue with axolotl.

@pacman100 Is therer a script for reproducing?

ftgreat · 2024-01-24T02:05:38Z

hang after several steps using Zero3. You can try this patch for similar cases.
hiyouga/LLaMA-Factory#2315

dumpmemory · 2024-01-24T02:20:16Z

hang after several steps using Zero3. You can try this patch for similar cases. hiyouga/LLaMA-Factory#2315

can u try this deepspeedai/DeepSpeed#4966 ?

amyeroberts transferred this issue from huggingface/transformers Dec 14, 2023

dumpmemory closed this as completed Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256

NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256

dumpmemory commented Dec 14, 2023 •

edited

Loading

amyeroberts commented Dec 14, 2023

muellerzr commented Dec 14, 2023

dumpmemory commented Dec 15, 2023 •

edited

Loading

dumpmemory commented Dec 15, 2023

dumpmemory commented Dec 15, 2023

pacman100 commented Dec 15, 2023

pacman100 commented Dec 15, 2023

dumpmemory commented Dec 15, 2023

dumpmemory commented Dec 15, 2023 •

edited

Loading

dumpmemory commented Dec 16, 2023

awzhgw commented Jan 14, 2024

tingxueronghua commented Jan 17, 2024

xs1997zju commented Jan 22, 2024

yuleiqin commented Jan 22, 2024

yuleiqin commented Jan 22, 2024

xs1997zju commented Jan 23, 2024

xs1997zju commented Jan 23, 2024

xs1997zju commented Jan 23, 2024

dumpmemory commented Jan 23, 2024

xs1997zju commented Jan 23, 2024

ftgreat commented Jan 24, 2024

dumpmemory commented Jan 24, 2024

NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256

NCCL TIMEOUT When i using axolotl full fine tuning mixtral7B x 8 #2256

Comments

dumpmemory commented Dec 14, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Dec 14, 2023

muellerzr commented Dec 14, 2023

dumpmemory commented Dec 15, 2023 • edited Loading

dumpmemory commented Dec 15, 2023

dumpmemory commented Dec 15, 2023

pacman100 commented Dec 15, 2023

pacman100 commented Dec 15, 2023

dumpmemory commented Dec 15, 2023

dumpmemory commented Dec 15, 2023 • edited Loading

dumpmemory commented Dec 16, 2023

awzhgw commented Jan 14, 2024

tingxueronghua commented Jan 17, 2024

xs1997zju commented Jan 22, 2024

yuleiqin commented Jan 22, 2024

yuleiqin commented Jan 22, 2024

xs1997zju commented Jan 23, 2024

xs1997zju commented Jan 23, 2024

xs1997zju commented Jan 23, 2024

dumpmemory commented Jan 23, 2024

xs1997zju commented Jan 23, 2024

ftgreat commented Jan 24, 2024

dumpmemory commented Jan 24, 2024

dumpmemory commented Dec 14, 2023 •

edited

Loading

dumpmemory commented Dec 15, 2023 •

edited

Loading

dumpmemory commented Dec 15, 2023 •

edited

Loading