Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence parallelism #45

Open
hatanp opened this issue Jul 11, 2024 · 2 comments
Open

Sequence parallelism #45

hatanp opened this issue Jul 11, 2024 · 2 comments
Assignees

Comments

@hatanp
Copy link
Collaborator

hatanp commented Jul 11, 2024

Currently it seems like both Megatron SP and DeepSpeed SP are not correctly implemented in Megatron-DeepSpeed. Maybe this was working once but since new features have been added there are conflicts between the two and for example flags that were once checking for megatron-SP were actually implemented to check for deepspeed SP. Sometimes these for example collect the wrong dimension like DeepSpeed SP does. Importantly the SP should also work with TP and PP to be useful for large scale training.

A ported Megatron-LM from 10/23 implements an SP succesfully but lacks some features such as some related to MoE, mendioned in issue #44

@Eugene29
Copy link

Hi, the source of SP hanging seems to be related to this commit. With everything else held constant, the commits before works, but the ones after hangs.

@hatanp
Copy link
Collaborator Author

hatanp commented Aug 30, 2024

That is a separate known issue. There is a barrier currently only tensor parallel rank 0 joins and the fix is relatively easy but not yet implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants