forked from NVIDIA/Megatron-LM
-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
147 additions
and
68 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,70 +1,149 @@ | ||
# Llama2/Llama3 Model Pretraining Instructions | ||
|
||
Llama2/Llama3 Model pretraining instruction | ||
|
||
1. Environment setup | ||
|
||
download docker image: xxxxx | ||
launch docker container: xxxx | ||
|
||
2. Configurations in script (Megatron/examples/llama) | ||
|
||
-- network interface: change "ens50f0np0" to your system network interface, by running "ip a" | ||
export NCCL_SOCKET_IFNAME=ens50f0np0 | ||
export GLOO_SOCKET_IFNAME=ens50f0np0 | ||
|
||
-- dataset: you can use both mock data and real data | ||
mock data: replace --data-path $DATA_PATH \ by --mock-data \ | ||
real data: change the data path accordingly | ||
DATA_DIR="/root/.cache/data" # change to where the dataset is stored | ||
DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence | ||
|
||
-- Tokenizer: HuggingFaceTokenizer, Llama2Tokenizer | ||
|
||
for Llama2 training, we use Llama2Tokenizer | ||
|
||
for Llama3 training, we use HuggingFaceTokenizer, set huggingface model link in TOKENIZER_MODEL as below | ||
|
||
TOKENIZER_MODEL=meta-llama/Llama-3.1-8B #llama3 | ||
|
||
-- multi-node training: | ||
MASTER_ADDR="${MASTER_ADDR:-localhost}" : change localhost to master node name | ||
NNODES="${NNODES:-1}" : change to # of nodes you want to train on, 2, 4, 8, etc. | ||
NODE_RANK="${NODE_RANK:-0}" : change to the rank number of each node, 0, 1, 2, .. NNODES-1 | ||
|
||
|
||
3. How to run | ||
|
||
--single node training: | ||
TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh | ||
Sample output: | ||
![alt text](image.png) | ||
|
||
|
||
--multi node training: | ||
Launch the same docker container on each node (2, 4, etc.) | ||
run the training script on each node inside the container, start from master node, then slave node | ||
master: TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh | ||
slave: TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh | ||
Sample output (2-node): | ||
master node: | ||
![alt text](image-1.png) | ||
slave node: | ||
![alt text](image-3.png) | ||
|
||
|
||
4. Pay attention to key variables | ||
-- TE_FP8: 0 - BP16, 1: FP8 | ||
-- GEMM_TUNING: 1 - enable gemm tuning, which will boost performance by leveraging best gemm kernels | ||
-- USE_FLASH_ATTN: 1 to enable flash attention | ||
-- ENABLE_PROFILING : 1 to enable pytorch profiling for performance analysis | ||
-- transformer-impl=transformer_engine : using transformer engine(TE), can set to local if you want to disable TE | ||
-- MODEL_SIZE: 7B, 70B for llama2, 8B, 70B for llama3/3.1 | ||
-- TOTAL_ITERS: 10 - total # of iterations | ||
|
||
|
||
|
||
|
||
|
||
|
||
This guide will walk you through setting up and running pretraining for Llama2 and Llama3 models using Docker, including single-node and multi-node setups. | ||
|
||
## 1. Environment Setup | ||
|
||
### Download Docker Image | ||
Download the necessary Docker image for the Llama2/Llama3 pretraining: | ||
|
||
```bash | ||
docker pull <image-name> | ||
``` | ||
|
||
### Launch Docker Container | ||
Launch the Docker container using the following command: | ||
|
||
```bash | ||
docker run --gpus all --rm -it <image-name> bash | ||
``` | ||
|
||
## 2. Configurations in the Training Script | ||
|
||
The training script is located in the `Megatron/examples/llama` directory. Below are the key configurations you need to adjust for your system: | ||
|
||
### Network Interface | ||
Update the network interface configuration to match your system's settings. | ||
|
||
- First, run `ip a` to identify your network interface. | ||
- Then, set the environment variables as follows: | ||
|
||
```bash | ||
export NCCL_SOCKET_IFNAME=ens50f0np0 | ||
export GLOO_SOCKET_IFNAME=ens50f0np0 | ||
``` | ||
|
||
Replace `ens50f0np0` with your system's actual network interface. | ||
|
||
### Dataset Configuration | ||
You can choose between mock data or real data for training. | ||
|
||
#### For Mock Data: | ||
Replace the `--data-path $DATA_PATH` argument with `--mock-data` in the script. | ||
|
||
```bash | ||
--mock-data | ||
``` | ||
|
||
#### For Real Data: | ||
Set the correct path to the dataset. Update the following environment variables in the script: | ||
|
||
```bash | ||
DATA_DIR="/root/.cache/data" # change to where your dataset is stored | ||
DATA_PATH=${DATA_DIR}/bookcorpus_text_sentence | ||
``` | ||
|
||
### Tokenizer | ||
- For **Llama2 training**, use `Llama2Tokenizer`. | ||
- For **Llama3 training**, use `HuggingFaceTokenizer`. | ||
|
||
Set the `TOKENIZER_MODEL` for Llama3 as shown below: | ||
|
||
```bash | ||
TOKENIZER_MODEL=meta-llama/Llama-3.1-8B # for Llama3 | ||
``` | ||
|
||
### Multi-node Training Configuration | ||
If you are running multi-node training, set the following parameters: | ||
|
||
- `MASTER_ADDR`: Change `localhost` to the master node's hostname. | ||
- `NNODES`: Set the number of nodes you want to train on (e.g., 2, 4, 8, etc.). | ||
- `NODE_RANK`: Set the rank of each node (from 0 to NNODES-1). | ||
|
||
```bash | ||
MASTER_ADDR="${MASTER_ADDR:-localhost}" | ||
NNODES="${NNODES:-1}" | ||
NODE_RANK="${NODE_RANK:-0}" | ||
``` | ||
|
||
## 3. How to Run | ||
|
||
### Single Node Training | ||
To run training on a single node, execute the following command: | ||
|
||
```bash | ||
TEE_OUTPUT=1 MBS=5 BS=120 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh | ||
``` | ||
|
||
Sample output: | ||
![Single Node Output](image.png) | ||
|
||
### Multi-node Training | ||
For multi-node training, launch the same Docker container on each node (2, 4, etc.). Then, run the training script on each node: | ||
|
||
1. **Master Node**: | ||
```bash | ||
TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh | ||
``` | ||
|
||
2. **Slave Node**: | ||
```bash | ||
TEE_OUTPUT=1 MBS=4 BS=64 TP=8 TE_FP8=0 NO_TORCH_COMPILE=1 SEQ_LENGTH=4096 bash train_llama2.sh | ||
``` | ||
|
||
Sample output for a 2-node setup: | ||
|
||
- **Master Node**: | ||
![Master Node Output](image-1.png) | ||
|
||
- **Slave Node**: | ||
![Slave Node Output](image-3.png) | ||
|
||
## 4. Key Variables to Pay Attention To | ||
|
||
Here are some important variables to configure during training: | ||
|
||
- `TE_FP8`: | ||
`0` - Use BP16, | ||
`1` - Use FP8 | ||
|
||
- `GEMM_TUNING`: | ||
`1` - Enable GEMM tuning to boost performance by leveraging the best GEMM kernels. | ||
|
||
- `USE_FLASH_ATTN`: | ||
`1` - Enable Flash Attention for faster computation. | ||
|
||
- `ENABLE_PROFILING`: | ||
`1` - Enable PyTorch profiling for performance analysis. | ||
|
||
- `transformer-impl`: | ||
Set this to `transformer_engine` to use the Transformer Engine (TE), or `local` to disable TE. | ||
|
||
- `MODEL_SIZE`: | ||
Set to `7B`, `70B` for Llama2 models, or `8B`, `70B` for Llama3 models. | ||
|
||
- `TOTAL_ITERS`: | ||
Set the total number of iterations (e.g., `10`). | ||
|
||
--- | ||
|
||
### Notes: | ||
|
||
- Make sure that all the required Docker and hardware configurations (e.g., GPU setup, NCCL, etc.) are properly set up before starting the training process. | ||
- Monitor resource utilization closely when training across multiple nodes to ensure optimal performance. | ||
|
||
--- | ||
|
||
### Conclusion | ||
|
||
You should now be ready to pretrain Llama2 or Llama3 models either on a single node or across multiple nodes. Make sure to carefully configure your environment and script settings to match your hardware and dataset. |