Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liger kernel brake fine tuning #5542

Closed
1 task done
arit2 opened this issue Sep 25, 2024 · 5 comments
Closed
1 task done

Liger kernel brake fine tuning #5542

arit2 opened this issue Sep 25, 2024 · 5 comments
Labels
solved This problem has been already solved

Comments

@arit2
Copy link

arit2 commented Sep 25, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

LLaMA Factory, version 0.9.1.dev0
liger_kernel 0.3.0
transformers 4.45.0.dev0

Reproduction

llamafactory-cli train ./examples/train_lora/qwen2vl_loraplus_dpo_2b_20_09.yaml

model

model_name_or_path: Qwen/Qwen2-VL-2B-Instruct

method

stage: dpo
do_train: true
finetuning_type: lora
lora_target: all
pref_beta: 0.3
pref_loss: sigmoid

dataset

dataset: obrazy_rlhf_v__proba
buffer_size: 1
preprocessing_batch_size: 1
streaming: true
val_size: 260
#accelerator_config:
dispatch_batches: false

template: qwen2_vl
cutoff_len: 2748
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 1

output

output_dir: saves/qwen2_vl-2b_loraplus/25v1_beta0_5_orig
logging_steps: 500
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 1
gradient_checkpointing: true
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 3.0

flash_attn: auto
lr_scheduler_type: cosine
max_grad_norm: 1.0
loraplus_lr_ratio: 16.0
enable_liger_kernel: true

warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
max_steps: 2200

eval

per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 200

Expected behavior

Unfortunately, running the training with liger kernel causes the following error:

[rank0]: AttributeError: 'NoneType' object has no attribute 'to'

My liger_kernel 0.3.0
llamafactory 0.9.1.dev0
transformers 4.45.0.dev0

09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model.
09/25/2024 12:07:58 - INFO - llamafactory.model.model_utils.liger_kernel - Liger kernel has been applied to the model.
[INFO|modeling_utils.py:3702] 2024-09-25 12:07:58,644 >> loading weights file model.safetensors from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/model.safetensors.index.json
[INFO|modeling_utils.py:1621] 2024-09-25 12:07:58,653 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1097] 2024-09-25 12:07:58,654 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

[WARNING|logging.py:328] 2024-09-25 12:07:58,688 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46
Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.88s/it]
[INFO|modeling_utils.py:4544] 2024-09-25 12:08:10,541 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|modeling_utils.py:4552] 2024-09-25 12:08:10,541 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at Qwen/Qwen2-VL-2B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training.
[INFO|configuration_utils.py:1052] 2024-09-25 12:08:10,685 >> loading configuration file generation_config.json from cache at /home/python/.cache/huggingface/hub/models--Qwen--Qwen2-VL-2B-Instruct/snapshots/aca78372505e6cb469c4fa6a35c60265b00ff5a4/generation_config.json
[INFO|configuration_utils.py:1097] 2024-09-25 12:08:10,685 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.01,
"top_k": 1,
"top_p": 0.001
}

09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
09/25/2024 12:08:10 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: o_proj,down_proj,q_proj,k_proj,gate_proj,up_proj,v_proj
09/25/2024 12:08:10 - INFO - llamafactory.model.model_utils.misc - Found linear modules: q_proj,v_proj,o_proj,gate_proj,down_proj,k_proj,up_proj
09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162
09/25/2024 12:08:11 - INFO - llamafactory.model.loader - trainable params: 9,232,384 || all params: 2,218,217,984 || trainable%: 0.4162
max_steps is given, it will override any value given in num_train_epochs
[WARNING|trainer.py:617] 2024-09-25 12:08:11,039 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:667] 2024-09-25 12:08:11,039 >> Using auto half precision backend
09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00.
09/25/2024 12:08:11 - INFO - llamafactory.train.trainer_utils - Using LoRA+ optimizer with loraplus lr ratio 16.00.
[INFO|trainer.py:2212] 2024-09-25 12:08:13,575 >> ***** Running training *****
[INFO|trainer.py:2213] 2024-09-25 12:08:13,575 >> Num examples = 4,400
[INFO|trainer.py:2214] 2024-09-25 12:08:13,575 >> Num Epochs = 9,223,372,036,854,775,807
[INFO|trainer.py:2215] 2024-09-25 12:08:13,575 >> Instantaneous batch size per device = 1
[INFO|trainer.py:2218] 2024-09-25 12:08:13,575 >> Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:2219] 2024-09-25 12:08:13,575 >> Gradient Accumulation steps = 1
[INFO|trainer.py:2220] 2024-09-25 12:08:13,575 >> Total optimization steps = 2,200
[INFO|trainer.py:2221] 2024-09-25 12:08:13,578 >> Number of trainable parameters = 9,232,384
0%| | 0/2200 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss
[rank0]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics
[rank0]: ) = self.concatenated_forward(model, batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward
[rank0]: all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'NoneType' object has no attribute 'to'
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank1]: launch()
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank1]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 81, in run_dpo
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2021, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 2357, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/transformers/trainer.py", line 3454, in training_step
[rank1]: loss = self.compute_loss(model, inputs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/env/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss
[rank1]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 232, in get_batch_loss_metrics
[rank1]: ) = self.concatenated_forward(model, batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/python/factory/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 182, in concatenated_forward
[rank1]: all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'to'
0%| | 0/2200 [00:13<?, ?it/s]
E0925 12:08:30.915000 140353497219136 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3061541) of binary: /home/python/factory/env/bin/python3
Traceback (most recent call last):
File "/home/python/factory/env/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/python/factory/env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 25, 2024
@d223302
Copy link

d223302 commented Sep 27, 2024

I encounter the same issue when using DPO to fine-tune qwen2-vl.
Here is my environment:

- `llamafactory` version: 0.9.1.dev0
- Platform: Linux-6.6.13-1-lts-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0+cu121
- Transformers version: 4.45.0.dev0
- Datasets version: 2.21.0
- Accelerate version: 0.34.2
- PEFT version: 0.12.0
- TRL version: 0.9.6

@Muyang12345
Copy link

I also encounter the same issue, the question seems to be casued by 'enable_liger_kernel: true', I want to use this parameter to reduce the memory footprint.

@hiyouga hiyouga added the bug Something isn't working label Sep 30, 2024
@hiyouga
Copy link
Owner

hiyouga commented Sep 30, 2024

fixed

@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels Sep 30, 2024
@rabiitmiao
Copy link

how to fix it

hurongliang pushed a commit to hurongliang/LLaMA-Factory that referenced this issue Oct 5, 2024
* 'main' of github.com:hurongliang/LLaMA-Factory: (61 commits)
  update wechat
  fix hiyouga#5542
  add patch processor func
  lint
  Update constants.py
  Update template.py
  fix chat template Exaone3.0
  Update README_zh.md
  Update README.md
  update docs Support model Exaone3.0
  add Exaone3.0 template
  Update common.py
  Update README_zh.md
  Update README.md
  Update README.md
  Update constants.py
  Update test_mm_plugin.py
  fix template
  fix template
  fix constants
  ...
@camposs1979
Copy link

I also encounter the same issue, the question seems to be casued by 'enable_liger_kernel: true', I want to use this parameter to reduce the memory footprint.

fixed

Not yet, the last code (download at 12.25) also has the same issue:
[WARNING|:208] 2024-12-25 16:05:55,347 >> ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\ /| Num examples = 17,970 | Num Epochs = 3
O^O/ _/ \ Batch size per device = 1 | Gradient Accumulation steps = 1
\ / Total batch size = 1 | Total steps = 53,910
"-____-" Number of trainable parameters = 67,108,864
0%| | 0/53910 [00:00<?, ?it/s]Traceback (most recent call last):
File "/root/miniconda3/envs/llamaf-env/bin/llamafactory-cli", line 8, in
sys.exit(main())
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/cli.py", line 112, in main
run_exp()
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 65, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/workflow.py", line 83, in run_dpo
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "", line 157, in train
File "", line 374, in _fast_inner_training_loop
File "", line 31, in _unsloth_training_step
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/trainer.py", line 280, in compute_loss
loss = super().compute_loss(model, inputs, return_outputs)
File "/root/miniconda3/envs/llamaf-env/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1408, in compute_loss
loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/trainer.py", line 244, in get_batch_loss_metrics
) = self.concatenated_forward(model, batch)
File "/root/autodl-tmp/LLaMA-Factory-main/src/llamafactory/train/dpo/trainer.py", line 194, in concatenated_forward
all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).logits.to(torch.float32)
AttributeError: 'NoneType' object has no attribute 'to'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

6 participants