modify bookcorpus dataset downloading ode (a suspected bug) and updat…

…e reaadme
ROCm · Feb 10, 2025 · b5a8379 · b5a8379
1 parent 32ee129
commit b5a8379
Show file tree

Hide file tree

Showing 2 changed files with 11 additions and 3 deletions.
diff --git a/examples/llama/prepare_dataset.sh b/examples/llama/prepare_dataset.sh
@@ -37,7 +37,7 @@ if [ "$DATASET" == "bookcorpus" ]; then
     echo "Downloading bookcorpus dataset to ${DATASET_PATH}..."
     python3 examples/llama/prepare_bookcorpus_megatron_dataset.py --out-dir ${DATASET_PATH}
     python3 tools/preprocess_data.py --input ${DATASET_PATH}/bookcorpus_megatron.json  --tokenizer-type GPTSentencePieceTokenizer \
-    --tokenizer-model ${TOKENIZER_MODEL} --output-prefix ${DATASET_PATH}/bookcorpus --workers `nproc` --split-sentences
+    --tokenizer-model ${TOKENIZER_MODEL} --output-prefix ${DATASET_PATH}/bookcorpus --workers `nproc` --split-sentences --partitions 2
 fi
 
 echo "Finishing data preparation!"
diff --git a/examples/llama/readme.md b/examples/llama/readme.md
@@ -90,10 +90,18 @@ You can use either mock data or real data for training.
   DATASET=bookcorpus bash examples/llama/prepare_dataset.sh #for bookcorpus dataset
   ```
 
+  Then you could launch training using the following commands:
+  ```bash
+  TEE_OUTPUT=1 MBS=1 BS=8 TP=8 TE_FP8=0 FSDP=1 SEQ_LENGTH=8192 TOKENIZER_TYPE=Llama2Tokenizer DATA_DIR=./tmp/data/bookcorpus bash examples/llama/train_llama2.sh #for downloaded bookcorpus dataset
+
+  TEE_OUTPUT=1 MBS=1 BS=8 TP=8 TE_FP8=0 FSDP=1 SEQ_LENGTH=8192 TOKENIZER_TYPE=Llama2Tokenizer DATA_DIR=./tmp/data/wiki DATA_PATH=./tmp/data/wiki/wikipedia_20220301.en.train.jsonl_text_document bash examples/llama/train_llama2.sh #for downloaded wikipedia dataset
+
+  ```
+
 - **Note:**
-  If using `Wikipedia-en` data for training Megatron-LM, in the training script, you need to set data path to specific file name that is pointing to `.bin` or `.idx` file, for example:
+  When training Megatron-LM, in the training script, you need to set data path to the specific file name that is pointing to `.bin` or `.idx` file, for example:
   ```bash
-  DATA_PATH=${DATA_DIR}/wikipedia_20220301.en/wikipedia_20220301.en.train.jsonl_text_document
+  DATA_PATH=${DATA_DIR}/wikipedia_20220301.en.train.jsonl_text_document
   ``` 
 
 ### 3.3 Tokenizer