-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FSDP arguments and example script to train model with FSDP-v2 #52
Conversation
Hi @wenchenvincent @gurpreet-dhami @lcskrishna
Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryang-amd Thanks for the timely updates. I added some more specific comments that. And I have a general suggestion. When you change the existing train_llama2.sh and train_llama3.sh, please make minimal changes and keep the existing default value for those parameters. For the parameters that you need to override, override it through command line and add the command options in the README.
eed35e2
to
00c73dc
Compare
Hi @wenchenvincent,
Please review when it's convenient for you. Thanks! |
Hi @wenchenvincent,
Please let me know if there are some other modifications needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryang-amd
No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}
We can always override the data path through arguments. Right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryang-amd
No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}
We can always override the data path through arguments. Right ?
Yes, changed them back accordingly. |
@wenchenvincent @gurpreet-dhami |
Hey @wenchenvincent @gurpreet-dhami , Please have a look when you have time. |
@ryang-amd , @wenchenvincent : shall we merge train_llama2 and train_llama3 files ? |
Let's not address this issue in this PR. We can consolidate that if possible in another PR. |
Hey @wenchenvincent @gurpreet-dhami,
My colleague @nsakkine found a potential bug when using the tools/preprocess_data.py to download bookcorpus dataset. Basically when using the default value of Megatron-LM/tools/preprocess_data.py Lines 347 to 348 in dea104b
|
@gurpreet-dhami authored these I think. @gurpreet-dhami could you check? |
That line of code is from NV upstream: https://github.com/ROCm/Megatron-LM/blame/dea104b977b34f30b3b4a6d003352601b8721c92/tools/preprocess_data.py#L347-L348 |
…th tp, sq and adam opti
…te instructions in readme
b5a8379
to
f9d9686
Compare
Hi @wenchenvincent @gurpreet-dhami, Small changes have been made according to Wen's latest comments:
For the bookcorpus dataset downloading issue with |
Thanks for addressing those. Could you also address the question whether FSDP and TP could be used together? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iterations of updates. LGTM.
Hi,
cc @wenchenvincent @gurpreet-dhami
I updated the README file with fsdp-related argument and example commands.
Meanwhile also included the training script for training Llama2 with FSDP-v2 enabled.
Please review and let me know if there needs more explanations.