Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]在checkpoint上继续训练会报错 #364

Closed
a-3pig opened this issue Jul 9, 2024 · 4 comments
Closed

[BUG]在checkpoint上继续训练会报错 #364

a-3pig opened this issue Jul 9, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@a-3pig
Copy link

a-3pig commented Jul 9, 2024

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots / log
If applicable, add screenshots / logs to help explain your problem.

Additional context
Add any other context about the problem here.

@a-3pig a-3pig added the bug Something isn't working label Jul 9, 2024
@a-3pig a-3pig changed the title [BUG]在cheakpoint [BUG]在checkpoint上继续训练会报错 Jul 9, 2024
@a-3pig
Copy link
Author

a-3pig commented Jul 9, 2024


distributed_backend=gloo
All distributed processes registered. Starting with 2 processes

E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\callbacks\model_checkpoint.py:652: Checkpoint directory E:\fish-speech1.2\fish-speech\results\lora_text2semantic_20240704_152159\checkpoints exists and is not empty.
Restoring states from the checkpoint path at results\lora_text2semantic_20240704_152159\checkpoints\step_000001000.ckpt
[2024-07-09 12:31:13,333][fish_speech.utils.utils][ERROR] - [rank: 0]
Traceback (most recent call last):
File "E:\fish-speech1.2\fish-speech\fish_speech\utils\utils.py", line 66, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
File "E:\fish-speech1.2\fish-speech\fish_speech\train.py", line 108, in train
trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\trainer\call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\strategies\launchers\subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\trainer\trainer.py", line 955, in _run
self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path)
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\trainer\connectors\checkpoint_connector.py", line 398, in _restore_modules_and_callbacks
self.restore_model()
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\trainer\connectors\checkpoint_connector.py", line 275, in restore_model
self.trainer.strategy.load_model_state_dict(
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\lightning\pytorch\strategies\strategy.py", line 371, in load_model_state_dict
self.lightning_module.load_state_dict(checkpoint["state_dict"], strict=strict)
File "E:\fish-speech1.2\fish-speech\fishenv\env\lib\site-packages\torch\nn\modules\module.py", line 2191, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for TextToSemantic:
Missing key(s) in state_dict: "model.embeddings.weight", "model.codebook_embeddings.weight", "model.layers.0.attention.wqkv.weight", "model.layers.0.attention.wo.weight", "model.layers.0.feed_forward.w1.weight", "model.layers.0.feed_forward.w3.weight", "model.layers.0.feed_forward.w2.weight", "model.layers.0.ffn_norm.weight", "model.layers.0.attention_norm.weight", "model.layers.1.attention.wqkv.weight", "model.layers.1.attention.wo.weight", "model.layers.1.feed_forward.w1.weight", "model.layers.1.feed_forward.w3.weight", "model.layers.1.feed_forward.w2.weight", "model.layers.1.ffn_norm.weight", "model.layers.1.attention_norm.weight", "model.layers.2.attention.wqkv.weight", "model.layers.2.attention.wo.weight", "model.layers.2.feed_forward.w1.weight", "model.layers.2.feed_forward.w3.weight", "model.layers.2.feed_forward.w2.weight", "model.layers.2.ffn_norm.weight", "model.layers.2.attention_norm.weight", "model.layers.3.attention.wqkv.weight", "model.layers.3.attention.wo.weight", "model.layers.3.feed_forward.w1.weight", "model.layers.3.feed_forward.w3.weight", "model.layers.3.feed_forward.w2.weight", "model.layers.3.ffn_norm.weight", "model.layers.3.attention_norm.weight", "model.layers.4.attention.wqkv.weight", "model.layers.4.attention.wo.weight", "model.layers.4.feed_forward.w1.weight", "model.layers.4.feed_forward.w3.weight", "model.layers.4.feed_forward.w2.weight", "model.layers.4.ffn_norm.weight", "model.layers.4.attention_norm.weight", "model.layers.5.attention.wqkv.weight", "model.layers.5.attention.wo.weight", "model.layers.5.feed_forward.w1.weight", "model.layers.5.feed_forward.w3.weight", "model.layers.5.feed_forward.w2.weight", "model.layers.5.ffn_norm.weight", "model.layers.5.attention_norm.weight", "model.layers.6.attention.wqkv.weight", "model.layers.6.attention.wo.weight", "model.layers.6.feed_forward.w1.weight", "model.layers.6.feed_forward.w3.weight", "model.layers.6.feed_forward.w2.weight", "model.layers.6.ffn_norm.weight", "model.layers.6.attention_norm.weight", "model.layers.7.attention.wqkv.weight", "model.layers.7.attention.wo.weight", "model.layers.7.feed_forward.w1.weight", "model.layers.7.feed_forward.w3.weight", "model.layers.7.feed_forward.w2.weight", "model.layers.7.ffn_norm.weight", "model.layers.7.attention_norm.weight", "model.layers.8.attention.wqkv.weight", "model.layers.8.attention.wo.weight", "model.layers.8.feed_forward.w1.weight", "model.layers.8.feed_forward.w3.weight", "model.layers.8.feed_forward.w2.weight", "model.layers.8.ffn_norm.weight", "model.layers.8.attention_norm.weight", "model.layers.9.attention.wqkv.weight", "model.layers.9.attention.wo.weight", "model.layers.9.feed_forward.w1.weight", "model.layers.9.feed_forward.w3.weight", "model.layers.9.feed_forward.w2.weight", "model.layers.9.ffn_norm.weight", "model.layers.9.attention_norm.weight", "model.layers.10.attention.wqkv.weight", "model.layers.10.attention.wo.weight", "model.layers.10.feed_forward.w1.weight", "model.layers.10.feed_forward.w3.weight", "model.layers.10.feed_forward.w2.weight", "model.layers.10.ffn_norm.weight", "model.layers.10.attention_norm.weight", "model.layers.11.attention.wqkv.weight", "model.layers.11.attention.wo.weight", "model.layers.11.feed_forward.w1.weight", "model.layers.11.feed_forward.w3.weight", "model.layers.11.feed_forward.w2.weight", "model.layers.11.ffn_norm.weight", "model.layers.11.attention_norm.weight", "model.layers.12.attention.wqkv.weight", "model.layers.12.attention.wo.weight", "model.layers.12.feed_forward.w1.weight", "model.layers.12.feed_forward.w3.weight", "model.layers.12.feed_forward.w2.weight", "model.layers.12.ffn_norm.weight", "model.layers.12.attention_norm.weight", "model.layers.13.attention.wqkv.weight", "model.layers.13.attention.wo.weight", "model.layers.13.feed_forward.w1.weight", "model.layers.13.feed_forward.w3.weight", "model.layers.13.feed_forward.w2.weight", "model.layers.13.ffn_norm.weight", "model.layers.13.attention_norm.weight", "model.layers.14.attention.wqkv.weight", "model.layers.14.attention.wo.weight", "model.layers.14.feed_forward.w1.weight", "model.layers.14.feed_forward.w3.weight", "model.layers.14.feed_forward.w2.weight", "model.layers.14.ffn_norm.weight", "model.layers.14.attention_norm.weight", "model.layers.15.attention.wqkv.weight", "model.layers.15.attention.wo.weight", "model.layers.15.feed_forward.w1.weight", "model.layers.15.feed_forward.w3.weight", "model.layers.15.feed_forward.w2.weight", "model.layers.15.ffn_norm.weight", "model.layers.15.attention_norm.weight", "model.layers.16.attention.wqkv.weight", "model.layers.16.attention.wo.weight", "model.layers.16.feed_forward.w1.weight", "model.layers.16.feed_forward.w3.weight", "model.layers.16.feed_forward.w2.weight", "model.layers.16.ffn_norm.weight", "model.layers.16.attention_norm.weight", "model.layers.17.attention.wqkv.weight", "model.layers.17.attention.wo.weight", "model.layers.17.feed_forward.w1.weight", "model.layers.17.feed_forward.w3.weight", "model.layers.17.feed_forward.w2.weight", "model.layers.17.ffn_norm.weight", "model.layers.17.attention_norm.weight", "model.layers.18.attention.wqkv.weight", "model.layers.18.attention.wo.weight", "model.layers.18.feed_forward.w1.weight", "model.layers.18.feed_forward.w3.weight", "model.layers.18.feed_forward.w2.weight", "model.layers.18.ffn_norm.weight", "model.layers.18.attention_norm.weight", "model.layers.19.attention.wqkv.weight", "model.layers.19.attention.wo.weight", "model.layers.19.feed_forward.w1.weight", "model.layers.19.feed_forward.w3.weight", "model.layers.19.feed_forward.w2.weight", "model.layers.19.ffn_norm.weight", "model.layers.19.attention_norm.weight", "model.layers.20.attention.wqkv.weight", "model.layers.20.attention.wo.weight", "model.layers.20.feed_forward.w1.weight", "model.layers.20.feed_forward.w3.weight", "model.layers.20.feed_forward.w2.weight", "model.layers.20.ffn_norm.weight", "model.layers.20.attention_norm.weight", "model.layers.21.attention.wqkv.weight", "model.layers.21.attention.wo.weight", "model.layers.21.feed_forward.w1.weight", "model.layers.21.feed_forward.w3.weight", "model.layers.21.feed_forward.w2.weight", "model.layers.21.ffn_norm.weight", "model.layers.21.attention_norm.weight", "model.layers.22.attention.wqkv.weight", "model.layers.22.attention.wo.weight", "model.layers.22.feed_forward.w1.weight", "model.layers.22.feed_forward.w3.weight", "model.layers.22.feed_forward.w2.weight", "model.layers.22.ffn_norm.weight", "model.layers.22.attention_norm.weight", "model.layers.23.attention.wqkv.weight", "model.layers.23.attention.wo.weight", "model.layers.23.feed_forward.w1.weight", "model.layers.23.feed_forward.w3.weight", "model.layers.23.feed_forward.w2.weight", "model.layers.23.ffn_norm.weight", "model.layers.23.attention_norm.weight", "model.norm.weight", "model.output.weight", "model.fast_embeddings.weight", "model.fast_layers.0.attention.wqkv.weight", "model.fast_layers.0.attention.wo.weight", "model.fast_layers.0.feed_forward.w1.weight", "model.fast_layers.0.feed_forward.w3.weight", "model.fast_layers.0.feed_forward.w2.weight", "model.fast_layers.0.ffn_norm.weight", "model.fast_layers.0.attention_norm.weight", "model.fast_layers.1.attention.wqkv.weight", "model.fast_layers.1.attention.wo.weight", "model.fast_layers.1.feed_forward.w1.weight", "model.fast_layers.1.feed_forward.w3.weight", "model.fast_layers.1.feed_forward.w2.weight", "model.fast_layers.1.ffn_norm.weight", "model.fast_layers.1.attention_norm.weight", "model.fast_layers.2.attention.wqkv.weight", "model.fast_layers.2.attention.wo.weight", "model.fast_layers.2.feed_forward.w1.weight", "model.fast_layers.2.feed_forward.w3.weight", "model.fast_layers.2.feed_forward.w2.weight", "model.fast_layers.2.ffn_norm.weight", "model.fast_layers.2.attention_norm.weight", "model.fast_layers.3.attention.wqkv.weight", "model.fast_layers.3.attention.wo.weight", "model.fast_layers.3.feed_forward.w1.weight", "model.fast_layers.3.feed_forward.w3.weight", "model.fast_layers.3.feed_forward.w2.weight", "model.fast_layers.3.ffn_norm.weight", "model.fast_layers.3.attention_norm.weight", "model.fast_norm.weight", "model.fast_output.weight".

@a-3pig
Copy link
Author

a-3pig commented Jul 9, 2024

image
image

@AnyaCoder
Copy link
Collaborator

Lora目前不支持断点训练.

@a-3pig
Copy link
Author

a-3pig commented Jul 9, 2024

Lora目前不支持断点训练.

了解

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants