-
Notifications
You must be signed in to change notification settings - Fork 579
pt_scripts_en
-
This code is only applicable to a specific PEFT version. Please install the PEFT with the commit id 13e53fc from the source code here. We cannot guarantee that the model can be trained normally with other versions of PEFT.
-
Make sure to pull the latest version of the repository before running:
git pull
Training script: scripts/training/run_clm_pt_with_peft.py
Go to the scripts/training
directory of the project and run bash run_pt.sh
to fine-tune the instructions. Single card is used by default. Before running, users should modify the script and specify relevant parameters. The parameter values in the script are for debugging reference only. The content of run_pt.sh
is as follows:
########Parameter settings########
lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05
pretrained_model=path/to/hf/llama-2/dir
chinese_tokenizer_path=path/to/chinese/llama-2/tokenizer/dir
dataset_dir=path/to/pt/data/dir
data_cache=temp_data_cache_dir
per_device_train_batch_size=1
training_steps=100
gradient_accumulation_steps=1
output_dir=output_dir
block_size=512
deepspeed_config_file=ds_zero2_no_offload.json
########Launch command########
torchrun --nnodes 1 --nproc_per_node 1 run_clm_pt_with_peft.py \
--deepspeed ${deepspeed_config_file} \
--model_name_or_path ${pretrained_model} \
--tokenizer_name_or_path ${chinese_tokenizer_path} \
--dataset_dir ${dataset_dir} \
--data_cache_dir ${data_cache} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--do_train \
--seed $RANDOM \
--fp16 \
--max_steps ${training_steps} \
--lr_scheduler_type cosine \
--learning_rate ${lr} \
--warmup_ratio 0.05 \
--weight_decay 0.01 \
--logging_strategy steps \
--logging_steps 10 \
--save_strategy steps \
--save_total_limit 3 \
--save_steps 500 \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--preprocessing_num_workers 8 \
--block_size ${block_size} \
--output_dir ${output_dir} \
--overwrite_output_dir \
--ddp_timeout 30000 \
--logging_first_step True \
--lora_rank ${lora_rank} \
--lora_alpha ${lora_alpha} \
--trainable ${lora_trainable} \
--modules_to_save ${modules_to_save} \
--lora_dropout ${lora_dropout} \
--torch_dtype float16 \
--save_safetensors False \
--load_in_kbits 16 \
--gradient_checkpointing \
--ddp_find_unused_parameters False
The explanation of parts of the parameters is as follows:
-
--dataset_dir
: Directory of pre-training data, which can contain multiple plain text files ending withtxt
-
--data_cache_dir
: Specify a directory for storing data cache files -
--use_flash_attention_2
: FlashAttention-2 training enabled -
--load_in_kbits
: The selectable options are [16,8,4], which means using fp16 or 8-bit/4-bit quantization for model training. The default is fp16 training. The other listed training-related hyperparameters, especially the learning rate and parameters related to the total batch size, are for reference only. Please configure them according to the data situation and hardware conditions in actual use.
【Must be carefully checked】 Below are the training modes supported by the script. Please pass model_name_or_path
according to the corresponding situation. In this project, LLaMA-2 model and Alpaca-2 model use the same tokenizer, and no distinction is made. Modes not listed in the table are not supported. If you want to make modifications, please debug by yourself.
Purpose | model_name_or_path | tokenizer_name_or_path | Final model vocabulary size |
---|---|---|---|
Train Chinese LLaMA-2 LoRA based on original LLaMA-2 | Original HF format LLaMA-2 | Chinese LLaMA-2's tokenizer (55296) | 55296 |
Continue pre-training on new LoRA based on Chinese LLaMA-2 | HF format complete Chinese LLaMA-2 | Chinese LLaMA-2's tokenizer (55296) | 55296 |
Continue pre-training on new LoRA based on Chinese Alpaca-2 | HF format complete Chinese Alpaca-2 | Chinese LLaMA-2's tokenizer (55296) | 55296 |
- If your machine's memory is tight, you can remove
--modules_to_save ${modules_to_save} \
from the script, i.e., do not train embed_tokens and lm_head (these two parts have a large number of parameters), only train LoRA parameters.- This operation can only be performed when training based on Chinese LLaMA-2 or Alpaca-2
- Reducing
block_size
can also reduce memory usage during training, such as settingblock_size
to 256. - Enabling
gradient_checkpointing
can effectively reduce VRAM usage, but it may slow down the training speed.
Please refer to the following launch method:
torchrun \
--nnodes ${num_nodes} \
--nproc_per_node ${num_gpu_per_node}
--node_rank ${node_rank} \
--master_addr ${master_addr} \
--master_port ${master_port} \
run_clm_pt_with_peft.py \
--deepspeed ${deepspeed_config_file} \
...
The LoRA weights and configuration after training are stored in ${output_dir}/pt_lora_model
, which can be used for subsequent merging processes.