Add FSDP arguments and example script to train model with FSDP-v2 #52

ryang-amd · 2025-01-28T21:32:23Z

I updated the README file with fsdp-related argument and example commands.
Meanwhile also included the training script for training Llama2 with FSDP-v2 enabled.

Please review and let me know if there needs more explanations.

examples/llama/train_llama2_fsdpv2.sh

ryang-amd · 2025-01-31T18:09:19Z

Hi @wenchenvincent @gurpreet-dhami @lcskrishna
I've updated the existing scripts and README:

add FSDP-v2 parameter settings into train_llama2.sh and train_llama3.sh
set TOKENIZER_TYPE in llama2 training script to choose different types of tokenizer
update the script's config name for different models
I also tested locally and the two scripts are working well.

Please review.

examples/llama/readme.md

examples/llama/train_llama2.sh

wenchenvincent

@ryang-amd Thanks for the timely updates. I added some more specific comments that. And I have a general suggestion. When you change the existing train_llama2.sh and train_llama3.sh, please make minimal changes and keep the existing default value for those parameters. For the parameters that you need to override, override it through command line and add the command options in the README.

ryang-amd · 2025-02-05T11:56:03Z

Hi @wenchenvincent,
Thanks for reviewing and giving me suggestions!
I've made changes according to your instructions.

Basically, train_llama2.sh and train_llama3.sh should now have the minimal changes compared to the original ones.
I've also added more instructions in the README file for setting FSDP and downloading Llama3.1 tokenizer.

Please review when it's convenient for you. Thanks!

examples/llama/readme.md

examples/llama/train_llama3.sh

examples/llama/train_llama2.sh

examples/llama/train_llama3.sh

ryang-amd · 2025-02-06T13:39:18Z

Hi @wenchenvincent,

I added two new arguments for ROPE_FUSION and DATA_TYPE
Update the logic among FSDP, TP, SQ and selection of optimizer when FSDP is off.
Update the examples of corresponding arguments in README

Please let me know if there are some other modifications needed.

gurpreet-dhami

@ryang-amd
No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}

We can always override the data path through arguments. Right ?

gurpreet-dhami

@ryang-amd
No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}

We can always override the data path through arguments. Right ?

ryang-amd · 2025-02-07T07:56:12Z

@ryang-amd No need to remove this - DATA_PATH=${DATA_PATH:-"$DATA_DIR/bookcorpus_text_sentence"}

We can always override the data path through arguments. Right ?

Yes, changed them back accordingly.

ryang-amd · 2025-02-07T13:34:30Z

@wenchenvincent @gurpreet-dhami
May I ask why there are duplicated lines (line 14-15, line17-18) in prepare_dataset.sh#L17? I assume it is typo. I removed it in the latest script.

ryang-amd · 2025-02-07T13:36:48Z

Hey @wenchenvincent @gurpreet-dhami ,
I also update the dataset preparation code, so that user can choose to download either wiki data or bookcorpus data.
The instructions in README is updated accordingly.

Please have a look when you have time.

gurpreet-dhami · 2025-02-07T17:17:29Z

@ryang-amd , @wenchenvincent : shall we merge train_llama2 and train_llama3 files ?
Is there any other difference than model parameters ?

wenchenvincent · 2025-02-07T18:37:24Z

@ryang-amd , @wenchenvincent : shall we merge train_llama2 and train_llama3 files ? Is there any other difference than model parameters ?

Let's not address this issue in this PR. We can consolidate that if possible in another PR.

ryang-amd · 2025-02-10T13:29:36Z

Hey @wenchenvincent @gurpreet-dhami,
I made two changes:

I've updated one line in downloading the bookcorpus dataset because it will not download properly. The modification of downloading bookcorpus dataset could be revert to original version, if the downloading issue is fixed. (See below)
Updated readme file.

My colleague @nsakkine found a potential bug when using the tools/preprocess_data.py to download bookcorpus dataset. Basically when using the default value of --partitions (partitions=1), it will not process the data correctly (empty .bin file), it looks like the following line causes the problem. Do you know why the line was written like this/who should we contact to fix that?

Megatron-LM/tools/preprocess_data.py

Lines 347 to 348 in dea104b

    
           if args.partitions == 1: 
        
               return

wenchenvincent · 2025-02-11T04:15:21Z

@wenchenvincent @gurpreet-dhami May I ask why there are duplicated lines (line 14-15, line17-18) in prepare_dataset.sh#L17? I assume it is typo. I removed it in the latest script.

@gurpreet-dhami authored these I think. @gurpreet-dhami could you check?

examples/llama/train_llama2.sh

examples/llama/train_llama3.sh

wenchenvincent · 2025-02-11T04:40:37Z

Hey @wenchenvincent @gurpreet-dhami, I made two changes:

I've updated one line in downloading the bookcorpus dataset because it will not download properly. The modification of downloading bookcorpus dataset could be revert to original version, if the downloading issue is fixed. (See below)

Updated readme file.

My colleague @nsakkine found a potential bug when using the tools/preprocess_data.py to download bookcorpus dataset. Basically when using the default value of --partitions (partitions=1), it will not process the data correctly (empty .bin file), it looks like the following line causes the problem. Do you know why the line was written like this/who should we contact to fix that?

Megatron-LM/tools/preprocess_data.py

Lines 347 to 348 in dea104b

if args.partitions == 1:

return

That line of code is from NV upstream: https://github.com/ROCm/Megatron-LM/blame/dea104b977b34f30b3b4a6d003352601b8721c92/tools/preprocess_data.py#L347-L348

…th tp, sq and adam opti

…te instructions in readme

…e reaadme

ryang-amd · 2025-02-11T17:43:20Z

Hi @wenchenvincent @gurpreet-dhami,

Small changes have been made according to Wen's latest comments:

recompute=0 by default
change DATA_TYPE to MOCK_DATA
update README accordingly

For the bookcorpus dataset downloading issue with --partitions=1, we can keep the updated code where I added --partitions=2 as an additional argument to preprocess the dataset, so that user can download the dataset without errors. We don't know yet why partitions=1 doesn't work and it works only when running this line twice. That's also the reason why there are duplicated lines in prepare_dataset.sh. (Thanks @gurpreet-dhami for today's discussion!)

wenchenvincent · 2025-02-12T05:01:51Z

Hi @wenchenvincent @gurpreet-dhami,

Small changes have been made according to Wen's latest comments:

recompute=0 by default

change DATA_TYPE to MOCK_DATA

update README accordingly

For the bookcorpus dataset downloading issue with --partitions=1, we can keep the updated code where I added --partitions=2 as an additional argument to preprocess the dataset, so that user can download the dataset without errors. We don't know yet why partitions=1 doesn't work and it works only when running this line twice. That's also the reason why there are duplicated lines in prepare_dataset.sh. (Thanks @gurpreet-dhami for today's discussion!)

Thanks for addressing those. Could you also address the question whether FSDP and TP could be used together?

wenchenvincent

Thanks for the iterations of updates. LGTM.

ryang-amd requested review from wenchenvincent and gurpreet-dhami January 28, 2025 21:32

gurpreet-dhami requested a review from lcskrishna January 29, 2025 21:06

wenchenvincent requested changes Jan 31, 2025

View reviewed changes

examples/llama/train_llama2_fsdpv2.sh Outdated Show resolved Hide resolved

lcskrishna requested changes Jan 31, 2025

View reviewed changes

examples/llama/train_llama2_fsdpv2.sh Outdated Show resolved Hide resolved

wenchenvincent reviewed Jan 31, 2025

View reviewed changes

examples/llama/readme.md Show resolved Hide resolved

wenchenvincent reviewed Jan 31, 2025

View reviewed changes

examples/llama/train_llama2.sh Outdated Show resolved Hide resolved