Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

显卡在运行 看不到训练进度[BUG] #390

Closed
chaoqunxie opened this issue Jul 17, 2024 · 8 comments
Closed

显卡在运行 看不到训练进度[BUG] #390

chaoqunxie opened this issue Jul 17, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@chaoqunxie
Copy link

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots / log
If applicable, add screenshots / logs to help explain your problem.

Additional context
Add any other context about the problem here.

4.3 M Trainable params
390 M Non-trainable params
394 M Total params
1,577.969 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s][2024-07-17 08:54:37,928][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-17 08:54:37,928][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:37,929][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:37,929][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-17 08:54:37,932][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:37,933][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:37,967][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:37,967][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:43,468][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-17 08:54:43,469][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:43,469][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-17 08:54:43,469][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:43,471][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:43,472][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:43,492][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:43,492][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
image

@chaoqunxie chaoqunxie added the bug Something isn't working label Jul 17, 2024
@chaoqunxie
Copy link
Author

单纯速度比较慢 过了很久发现有进度了
image

@chaoqunxie
Copy link
Author

这之后就不打印了
image

@chaoqunxie
Copy link
Author

速度很慢

@Stardust-minus
Copy link
Member

什么GPU

@chaoqunxie
Copy link
Author

什么GPU

2080ti 魔改的22g 跑起来后会报错 重启机器才行 我感觉把我显卡给跑坏咯
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/exp/fish_speech/train.py", line 135, in main
train(cfg)
File "/exp/fish_speech/utils/utils.py", line 77, in wrap
raise ex
File "/exp/fish_speech/utils/utils.py", line 66, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
File "/exp/fish_speech/train.py", line 108, in train
trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 60, in _call_and_handle_interrupt
trainer._teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1009, in _teardown
self.strategy.teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 419, in teardown
super().teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/parallel.py", line 133, in teardown
super().teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 531, in teardown
_optimizers_to_device(self.optimizers, torch.device("cpu"))
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/optimizer.py", line 28, in _optimizers_to_device
_optimizer_to_device(opt, device)
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/optimizer.py", line 34, in _optimizer_to_device
optimizer.state[p] = apply_to_collection(v, Tensor, move_data_to_device, device, allow_frozen=True)
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 52, in apply_to_collection
return _apply_to_collection_slow(
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow
v = _apply_to_collection_slow(
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 96, in _apply_to_collection_slow
return function(data, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/apply_func.py", line 103, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection
return function(data, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/apply_func.py", line 97, in batch_to
data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[W CUDAGuardImpl.h:118] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f42ad77a897 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f42ad72ab25 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f42adb6f718 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1d8d6 (0x7f42adb3a8d6 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x1f5e3 (0x7f42adb3c5e3 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x1f922 (0x7f42adb3c922 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x5a5950 (0x7f42ac086950 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x6a36f (0x7f42ad75f36f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f42ad7581cb in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f42ad758379 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: c10d::Reducer::~Reducer() + 0x5c4 (0x7f4299cc69d4 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f42ac7c9552 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f42abf51788 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0xcec001 (0x7f42ac7cd001 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #14: + 0x47b773 (0x7f42abf5c773 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0x47c6f1 (0x7f42abf5d6f1 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)

@chaoqunxie
Copy link
Author

image

@UltramanSleepless
Copy link

chaoqunxie
需要训练多久 我也是魔改的2080ti

@chaoqunxie
Copy link
Author

chaoqunxie commented Jul 23, 2024

chaoqunxie
需要训练多久 我也是魔改的2080ti

step 3000 跑了4个小时15分钟

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants