-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fused kernel compilation could get stuck #82
Comments
same problem, stuck at here.
|
skip it by setting --no-scaled-masked-softmax-fusion. |
delete |
Same issue. |
Experiencing the same issue here, even if the observed behaviour was different on different nodes of the cluster (not sure if it was caused by different software stacks or different gpus). Deleting |
Deleting We will be addressing this issue soon by moving to using the same prebuilt kernels from Apex and not requiring this custom kernel build step. I'll close this issue when that happens. |
Marking as stale. No activity in 60 days. |
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why? |
Marking as stale. No activity in 60 days. |
Same here. Did you solved this problem? |
|
+1 same issue here |
same here |
Actually, it seems to be a problem with pytorch barrier, and simply setting NCCL_P2P_DISABLE=1 worked for me. |
awesome to hear, will try this, thanks! |
I met this problem on one of my nodes. |
Marking as stale. No activity in 60 days. |
Hi,
I've noticed that the program could get stuck at "using torch.float16 for parameters ...". I found that the problem was stuck at compilating fused_kernels and deleting megatron/fused_kernel/build seems to fix the problem. I'm not sure what causes this.
I'm posting this in hope it could be helpful.
The text was updated successfully, but these errors were encountered: