Update readme.md

ROCm · Nov 13, 2024 · d21a895 · d21a895
1 parent cbf5888
commit d21a895
Showing 1 changed file with 147 additions and 68 deletions.
diff --git a/examples/llama/readme.md b/examples/llama/readme.md
@@ -1,70 +1,149 @@
+# Llama2/Llama3 Model Pretraining Instructions
 
-Llama2/Llama3 Model pretraining instruction 
-
-1. Environment setup 
-
-   download docker image: xxxxx
-   launch docker container: xxxx
-
-2. Configurations in script (Megatron/examples/llama)
-
-   -- network interface: change "ens50f0np0" to your system network interface, by running "ip a"  
-      export NCCL_SOCKET_IFNAME=ens50f0np0  
-      export GLOO_SOCKET_IFNAME=ens50f0np0 
-
-   -- dataset: you can use both mock data and real data
-      mock data: replace --data-path $DATA_PATH \ by --mock-data \
-      real data: change the data path accordingly 
-           DATA_DIR="/root/.cache/data"  # change to where the dataset is stored
-           DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence
-
-   -- Tokenizer: HuggingFaceTokenizer, Llama2Tokenizer
-
-      for Llama2 training, we use Llama2Tokenizer
-
-      for Llama3 training, we use HuggingFaceTokenizer, set huggingface model link in TOKENIZER_MODEL as below
-
-      TOKENIZER_MODEL=meta-llama/Llama-3.1-8B #llama3      
-
-   -- multi-node training: 
-        MASTER_ADDR="${MASTER_ADDR:-localhost}" : change localhost to master node name
-        NNODES="${NNODES:-1}" : change to # of nodes you want to train on, 2, 4, 8, etc. 
-        NODE_RANK="${NODE_RANK:-0}" : change to the rank number of each node, 0, 1, 2, .. NNODES-1
-
-
-3. How to run 
-
-   --single node training: 
-   TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh
-   Sample output:
-   ![alt text](image.png)
-
-
-   --multi node training: 
-   Launch the same docker container on each node (2, 4, etc.)
-   run the training script on each node inside the container, start from master node, then slave node
-   master: TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh
-   slave:  TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh 
-   Sample output (2-node):
-      master node: 
-      ![alt text](image-1.png)
-      slave node:
-      ![alt text](image-3.png)
-
-
-4. Pay attention to key variables  
-   -- TE_FP8: 0 - BP16, 1: FP8
-   -- GEMM_TUNING: 1 - enable gemm tuning, which will boost performance by leveraging best gemm kernels
-   -- USE_FLASH_ATTN: 1 to enable flash attention
-   -- ENABLE_PROFILING : 1 to enable pytorch profiling for performance analysis 
-   -- transformer-impl=transformer_engine : using transformer engine(TE), can set to local if you want to disable TE
-   -- MODEL_SIZE: 7B, 70B for llama2, 8B, 70B for llama3/3.1 
-   -- TOTAL_ITERS: 10 - total # of iterations 
-
-
-
-
-
-      
-
+This guide will walk you through setting up and running pretraining for Llama2 and Llama3 models using Docker, including single-node and multi-node setups.
 
+## 1. Environment Setup
+
+### Download Docker Image
+Download the necessary Docker image for the Llama2/Llama3 pretraining:
+
+```bash
+docker pull <image-name>
+```
+
+### Launch Docker Container
+Launch the Docker container using the following command:
+
+```bash
+docker run --gpus all --rm -it <image-name> bash
+```
+
+## 2. Configurations in the Training Script
+
+The training script is located in the `Megatron/examples/llama` directory. Below are the key configurations you need to adjust for your system:
+
+### Network Interface
+Update the network interface configuration to match your system's settings.
+
+- First, run `ip a` to identify your network interface.
+- Then, set the environment variables as follows:
+
+```bash
+export NCCL_SOCKET_IFNAME=ens50f0np0
+export GLOO_SOCKET_IFNAME=ens50f0np0
+```
+
+Replace `ens50f0np0` with your system's actual network interface.
+
+### Dataset Configuration
+You can choose between mock data or real data for training.
+
+#### For Mock Data:
+Replace the `--data-path $DATA_PATH` argument with `--mock-data` in the script.
+
+```bash
+--mock-data
+```
+
+#### For Real Data:
+Set the correct path to the dataset. Update the following environment variables in the script:
+
+```bash
+DATA_DIR="/root/.cache/data"  # change to where your dataset is stored
+DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence
+```
+
+### Tokenizer
+- For **Llama2 training**, use `Llama2Tokenizer`.
+- For **Llama3 training**, use `HuggingFaceTokenizer`.
+
+Set the `TOKENIZER_MODEL` for Llama3 as shown below:
+
+```bash
+TOKENIZER_MODEL=meta-llama/Llama-3.1-8B  # for Llama3
+```
+
+### Multi-node Training Configuration
+If you are running multi-node training, set the following parameters:
+
+- `MASTER_ADDR`: Change `localhost` to the master node's hostname.
+- `NNODES`: Set the number of nodes you want to train on (e.g., 2, 4, 8, etc.).
+- `NODE_RANK`: Set the rank of each node (from 0 to NNODES-1).
+
+```bash
+MASTER_ADDR="${MASTER_ADDR:-localhost}"
+NNODES="${NNODES:-1}"
+NODE_RANK="${NODE_RANK:-0}"
+```
+
+## 3. How to Run
+
+### Single Node Training
+To run training on a single node, execute the following command:
+
+```bash
+TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh
+```
+
+Sample output:
+![Single Node Output](image.png)
+
+### Multi-node Training
+For multi-node training, launch the same Docker container on each node (2, 4, etc.). Then, run the training script on each node:
+
+1. **Master Node**:
+   ```bash
+   TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh
+   ```
+
+2. **Slave Node**:
+   ```bash
+   TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh
+   ```
+
+Sample output for a 2-node setup:
+
+- **Master Node**:
+  ![Master Node Output](image-1.png)
+
+- **Slave Node**:
+  ![Slave Node Output](image-3.png)
+
+## 4. Key Variables to Pay Attention To
+
+Here are some important variables to configure during training:
+
+- `TE_FP8`:  
+  `0` - Use BP16,  
+  `1` - Use FP8
+
+- `GEMM_TUNING`:  
+  `1` - Enable GEMM tuning to boost performance by leveraging the best GEMM kernels.
+
+- `USE_FLASH_ATTN`:  
+  `1` - Enable Flash Attention for faster computation.
+
+- `ENABLE_PROFILING`:  
+  `1` - Enable PyTorch profiling for performance analysis.
+
+- `transformer-impl`:  
+  Set this to `transformer_engine` to use the Transformer Engine (TE), or `local` to disable TE.
+
+- `MODEL_SIZE`:  
+  Set to `7B`, `70B` for Llama2 models, or `8B`, `70B` for Llama3 models.
+
+- `TOTAL_ITERS`:  
+  Set the total number of iterations (e.g., `10`).
+
+---
+
+### Notes:
+
+- Make sure that all the required Docker and hardware configurations (e.g., GPU setup, NCCL, etc.) are properly set up before starting the training process.
+- Monitor resource utilization closely when training across multiple nodes to ensure optimal performance.
+
+---
+
+### Conclusion
+
+You should now be ready to pretrain Llama2 or Llama3 models either on a single node or across multiple nodes. Make sure to carefully configure your environment and script settings to match your hardware and dataset.