Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FSDP arguments and example script to train model with FSDP-v2 #52

Merged
merged 13 commits into from
Feb 13, 2025

Conversation

ryang-amd
Copy link

Hi,
cc @wenchenvincent @gurpreet-dhami

I updated the README file with fsdp-related argument and example commands.
Meanwhile also included the training script for training Llama2 with FSDP-v2 enabled.

Please review and let me know if there needs more explanations.

examples/llama/train_llama2_fsdpv2.sh Outdated Show resolved Hide resolved
@ryang-amd
Copy link
Author

Hi @wenchenvincent @gurpreet-dhami @lcskrishna
I've updated the existing scripts and README:

  • add FSDP-v2 parameter settings into train_llama2.sh and train_llama3.sh
  • set TOKENIZER_TYPE in llama2 training script to choose different types of tokenizer
  • update the script's config name for different models
    I also tested locally and the two scripts are working well.

Please review.

Copy link
Collaborator

@wenchenvincent wenchenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryang-amd Thanks for the timely updates. I added some more specific comments that. And I have a general suggestion. When you change the existing train_llama2.sh and train_llama3.sh, please make minimal changes and keep the existing default value for those parameters. For the parameters that you need to override, override it through command line and add the command options in the README.

@ryang-amd ryang-amd force-pushed the example_fsdpv2 branch 2 times, most recently from eed35e2 to 00c73dc Compare February 5, 2025 11:45
@ryang-amd
Copy link
Author

Hi @wenchenvincent,
Thanks for reviewing and giving me suggestions!
I've made changes according to your instructions.

  • Basically, train_llama2.sh and train_llama3.sh should now have the minimal changes compared to the original ones.
  • I've also added more instructions in the README file for setting FSDP and downloading Llama3.1 tokenizer.

Please review when it's convenient for you. Thanks!

@ryang-amd
Copy link
Author

Hi @wenchenvincent,

  • I added two new arguments for ROPE_FUSION and DATA_TYPE
  • Update the logic among FSDP, TP, SQ and selection of optimizer when FSDP is off.
  • Update the examples of corresponding arguments in README

Please let me know if there are some other modifications needed.

Copy link
Collaborator

@gurpreet-dhami gurpreet-dhami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryang-amd
No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}

We can always override the data path through arguments. Right ?

Copy link
Collaborator

@gurpreet-dhami gurpreet-dhami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryang-amd
No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}

We can always override the data path through arguments. Right ?

@ryang-amd
Copy link
Author

@ryang-amd No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}

We can always override the data path through arguments. Right ?

Yes, changed them back accordingly.

@ryang-amd
Copy link
Author

ryang-amd commented Feb 7, 2025

@wenchenvincent @gurpreet-dhami
May I ask why there are duplicated lines (line 14-15, line17-18) in prepare_dataset.sh#L17? I assume it is typo. I removed it in the latest script.

@ryang-amd
Copy link
Author

Hey @wenchenvincent @gurpreet-dhami ,
I also update the dataset preparation code, so that user can choose to download either wiki data or bookcorpus data.
The instructions in README is updated accordingly.

Please have a look when you have time.

@gurpreet-dhami
Copy link
Collaborator

gurpreet-dhami commented Feb 7, 2025

@ryang-amd , @wenchenvincent : shall we merge train_llama2 and train_llama3 files ?
Is there any other difference than model parameters ?

@wenchenvincent
Copy link
Collaborator

@ryang-amd , @wenchenvincent : shall we merge train_llama2 and train_llama3 files ? Is there any other difference than model parameters ?

Let's not address this issue in this PR. We can consolidate that if possible in another PR.

@ryang-amd
Copy link
Author

ryang-amd commented Feb 10, 2025

Hey @wenchenvincent @gurpreet-dhami,
I made two changes:

  1. I've updated one line in downloading the bookcorpus dataset because it will not download properly. The modification of downloading bookcorpus dataset could be revert to original version, if the downloading issue is fixed. (See below)
  2. Updated readme file.

My colleague @nsakkine found a potential bug when using the tools/preprocess_data.py to download bookcorpus dataset. Basically when using the default value of --partitions (partitions=1), it will not process the data correctly (empty .bin file), it looks like the following line causes the problem. Do you know why the line was written like this/who should we contact to fix that?

if args.partitions == 1:
return

@wenchenvincent
Copy link
Collaborator

@wenchenvincent @gurpreet-dhami May I ask why there are duplicated lines (line 14-15, line17-18) in prepare_dataset.sh#L17? I assume it is typo. I removed it in the latest script.

@gurpreet-dhami authored these I think. @gurpreet-dhami could you check?

@wenchenvincent
Copy link
Collaborator

Hey @wenchenvincent @gurpreet-dhami, I made two changes:

  1. I've updated one line in downloading the bookcorpus dataset because it will not download properly. The modification of downloading bookcorpus dataset could be revert to original version, if the downloading issue is fixed. (See below)
  2. Updated readme file.

My colleague @nsakkine found a potential bug when using the tools/preprocess_data.py to download bookcorpus dataset. Basically when using the default value of --partitions (partitions=1), it will not process the data correctly (empty .bin file), it looks like the following line causes the problem. Do you know why the line was written like this/who should we contact to fix that?

if args.partitions == 1:
return

That line of code is from NV upstream: https://github.com/ROCm/Megatron-LM/blame/dea104b977b34f30b3b4a6d003352601b8721c92/tools/preprocess_data.py#L347-L348

@ryang-amd
Copy link
Author

Hi @wenchenvincent @gurpreet-dhami,

Small changes have been made according to Wen's latest comments:

  • recompute=0 by default
  • change DATA_TYPE to MOCK_DATA
  • update README accordingly

For the bookcorpus dataset downloading issue with --partitions=1, we can keep the updated code where I added --partitions=2 as an additional argument to preprocess the dataset, so that user can download the dataset without errors. We don't know yet why partitions=1 doesn't work and it works only when running this line twice. That's also the reason why there are duplicated lines in prepare_dataset.sh. (Thanks @gurpreet-dhami for today's discussion!)

@wenchenvincent
Copy link
Collaborator

Hi @wenchenvincent @gurpreet-dhami,

Small changes have been made according to Wen's latest comments:

  • recompute=0 by default
  • change DATA_TYPE to MOCK_DATA
  • update README accordingly

For the bookcorpus dataset downloading issue with --partitions=1, we can keep the updated code where I added --partitions=2 as an additional argument to preprocess the dataset, so that user can download the dataset without errors. We don't know yet why partitions=1 doesn't work and it works only when running this line twice. That's also the reason why there are duplicated lines in prepare_dataset.sh. (Thanks @gurpreet-dhami for today's discussion!)

Thanks for addressing those. Could you also address the question whether FSDP and TP could be used together?

Copy link
Collaborator

@wenchenvincent wenchenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iterations of updates. LGTM.

@lcskrishna lcskrishna merged commit 2fc81b6 into ROCm:rocm_dev Feb 13, 2025
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants