You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently it seems like both Megatron SP and DeepSpeed SP are not correctly implemented in Megatron-DeepSpeed. Maybe this was working once but since new features have been added there are conflicts between the two and for example flags that were once checking for megatron-SP were actually implemented to check for deepspeed SP. Sometimes these for example collect the wrong dimension like DeepSpeed SP does. Importantly the SP should also work with TP and PP to be useful for large scale training.
A ported Megatron-LM from 10/23 implements an SP succesfully but lacks some features such as some related to MoE, mendioned in issue #44
The text was updated successfully, but these errors were encountered:
Hi, the source of SP hanging seems to be related to this commit. With everything else held constant, the commits before works, but the ones after hangs.
Currently it seems like both Megatron SP and DeepSpeed SP are not correctly implemented in Megatron-DeepSpeed. Maybe this was working once but since new features have been added there are conflicts between the two and for example flags that were once checking for megatron-SP were actually implemented to check for deepspeed SP. Sometimes these for example collect the wrong dimension like DeepSpeed SP does. Importantly the SP should also work with TP and PP to be useful for large scale training.
A ported Megatron-LM from 10/23 implements an SP succesfully but lacks some features such as some related to MoE, mendioned in issue #44
The text was updated successfully, but these errors were encountered: