You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since it seems like my comment in the FSDP PR for NeMo 2.0 was missed in the activities of code reviews, I wrote this issue to raise visibility for the current problems with FSDP integration in NeMo.
To summarize the original comment, FSDP support is really not well-integrated across the code base. Many scripts do not work with it because they only implement NLPDDPStrategy, checkpoint conversion was a huge undertaking to get working (PR pending), and some feature flags straight-up don't work or had bugs.
The original reason for the comment was desiring a better integration in NeMo 2.X since FSDP actually achieves better performance with default settings compared to 4D parallelism. There is also desire to help with this; however, because of past/current experiences with PRs getting ignored, my motivation to fix things in this public code without prior discussion is highly limited. I don't want to have to post redundant comments to even more PRs to prevent them from getting closed as "stale due to no activity".
The text was updated successfully, but these errors were encountered:
Thank you for the honest answer! I know it's a separate issue bit I wish you could address the PR reviewing issue as well. What are ways to actually get PRs looked at as a non-NVIDIA employee?
Since it seems like my comment in the FSDP PR for NeMo 2.0 was missed in the activities of code reviews, I wrote this issue to raise visibility for the current problems with FSDP integration in NeMo.
To summarize the original comment, FSDP support is really not well-integrated across the code base. Many scripts do not work with it because they only implement
NLPDDPStrategy
, checkpoint conversion was a huge undertaking to get working (PR pending), and some feature flags straight-up don't work or had bugs.The original reason for the comment was desiring a better integration in NeMo 2.X since FSDP actually achieves better performance with default settings compared to 4D parallelism. There is also desire to help with this; however, because of past/current experiences with PRs getting ignored, my motivation to fix things in this public code without prior discussion is highly limited. I don't want to have to post redundant comments to even more PRs to prevent them from getting closed as "stale due to no activity".
The text was updated successfully, but these errors were encountered: