Skip to content

Commit

Permalink
modify bookcorpus dataset downloading ode (a suspected bug) and updat…
Browse files Browse the repository at this point in the history
…e reaadme
  • Loading branch information
ryang-amd committed Feb 10, 2025
1 parent 32ee129 commit b5a8379
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 3 deletions.
2 changes: 1 addition & 1 deletion examples/llama/prepare_dataset.sh
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ if [ "$DATASET" == "bookcorpus" ]; then
echo "Downloading bookcorpus dataset to ${DATASET_PATH}..."
python3 examples/llama/prepare_bookcorpus_megatron_dataset.py --out-dir ${DATASET_PATH}
python3 tools/preprocess_data.py --input ${DATASET_PATH}/bookcorpus_megatron.json --tokenizer-type GPTSentencePieceTokenizer \
--tokenizer-model ${TOKENIZER_MODEL} --output-prefix ${DATASET_PATH}/bookcorpus --workers `nproc` --split-sentences
--tokenizer-model ${TOKENIZER_MODEL} --output-prefix ${DATASET_PATH}/bookcorpus --workers `nproc` --split-sentences --partitions 2
fi

echo "Finishing data preparation!"
12 changes: 10 additions & 2 deletions examples/llama/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,10 +90,18 @@ You can use either mock data or real data for training.
DATASET=bookcorpus bash examples/llama/prepare_dataset.sh #for bookcorpus dataset
```

Then you could launch training using the following commands:
```bash
TEE_OUTPUT=1 MBS=1 BS=8 TP=8 TE_FP8=0 FSDP=1 SEQ_LENGTH=8192 TOKENIZER_TYPE=Llama2Tokenizer DATA_DIR=./tmp/data/bookcorpus bash examples/llama/train_llama2.sh #for downloaded bookcorpus dataset

TEE_OUTPUT=1 MBS=1 BS=8 TP=8 TE_FP8=0 FSDP=1 SEQ_LENGTH=8192 TOKENIZER_TYPE=Llama2Tokenizer DATA_DIR=./tmp/data/wiki DATA_PATH=./tmp/data/wiki/wikipedia_20220301.en.train.jsonl_text_document bash examples/llama/train_llama2.sh #for downloaded wikipedia dataset

```

- **Note:**
If using `Wikipedia-en` data for training Megatron-LM, in the training script, you need to set data path to specific file name that is pointing to `.bin` or `.idx` file, for example:
When training Megatron-LM, in the training script, you need to set data path to the specific file name that is pointing to `.bin` or `.idx` file, for example:
```bash
DATA_PATH=${DATA_DIR}/wikipedia_20220301.en/wikipedia_20220301.en.train.jsonl_text_document
DATA_PATH=${DATA_DIR}/wikipedia_20220301.en.train.jsonl_text_document
```

### 3.3 Tokenizer
Expand Down

0 comments on commit b5a8379

Please sign in to comment.