显卡在运行看不到训练进度[BUG] #390

chaoqunxie · 2024-07-17T09:08:05Z

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots / log
If applicable, add screenshots / logs to help explain your problem.

Additional context
Add any other context about the problem here.

4.3 M Trainable params
390 M Non-trainable params
394 M Total params
1,577.969 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s][2024-07-17 08:54:37,928][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-17 08:54:37,928][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:37,929][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:37,929][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-17 08:54:37,932][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:37,933][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:37,967][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:37,967][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:43,468][fish_speech.datasets.text][INFO] - [rank: 0] Reading 2 / 1 files
[2024-07-17 08:54:43,469][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:43,469][fish_speech.datasets.text][INFO] - [rank: 0] Read total 2 groups of data
[2024-07-17 08:54:43,469][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:43,471][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:43,472][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data
[2024-07-17 08:54:43,492][fish_speech.datasets.text][INFO] - [rank: 0] Reading 1 / 1 files
[2024-07-17 08:54:43,492][fish_speech.datasets.text][INFO] - [rank: 0] Read total 1 groups of data

chaoqunxie · 2024-07-17T09:09:31Z

单纯速度比较慢过了很久发现有进度了

chaoqunxie · 2024-07-17T09:19:32Z

这之后就不打印了

chaoqunxie · 2024-07-17T09:37:00Z

速度很慢

Stardust-minus · 2024-07-17T09:38:00Z

什么GPU

chaoqunxie · 2024-07-18T10:35:47Z

什么GPU

2080ti 魔改的22g 跑起来后会报错重启机器才行我感觉把我显卡给跑坏咯
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/exp/fish_speech/train.py", line 135, in main
train(cfg)
File "/exp/fish_speech/utils/utils.py", line 77, in wrap
raise ex
File "/exp/fish_speech/utils/utils.py", line 66, in wrap
metric_dict, object_dict = task_func(cfg=cfg)
File "/exp/fish_speech/train.py", line 108, in train
trainer.fit(model=model, datamodule=datamodule, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 60, in _call_and_handle_interrupt
trainer._teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1009, in _teardown
self.strategy.teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 419, in teardown
super().teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/parallel.py", line 133, in teardown
super().teardown()
File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 531, in teardown
_optimizers_to_device(self.optimizers, torch.device("cpu"))
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/optimizer.py", line 28, in _optimizers_to_device
_optimizer_to_device(opt, device)
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/optimizer.py", line 34, in _optimizer_to_device
optimizer.state[p] = apply_to_collection(v, Tensor, move_data_to_device, device, allow_frozen=True)
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 52, in apply_to_collection
return _apply_to_collection_slow(
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 104, in _apply_to_collection_slow
v = _apply_to_collection_slow(
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 96, in _apply_to_collection_slow
return function(data, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/apply_func.py", line 103, in move_data_to_device
return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
File "/usr/local/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection
return function(data, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/lightning/fabric/utilities/apply_func.py", line 97, in batch_to
data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[W CUDAGuardImpl.h:118] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f42ad77a897 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f42ad72ab25 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f42adb6f718 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1d8d6 (0x7f42adb3a8d6 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x1f5e3 (0x7f42adb3c5e3 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x1f922 (0x7f42adb3c922 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x5a5950 (0x7f42ac086950 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x6a36f (0x7f42ad75f36f in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f42ad7581cb in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #9: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f42ad758379 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: c10d::Reducer::~Reducer() + 0x5c4 (0x7f4299cc69d4 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f42ac7c9552 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #12: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7f42abf51788 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0xcec001 (0x7f42ac7cd001 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #14: + 0x47b773 (0x7f42abf5c773 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #15: + 0x47c6f1 (0x7f42abf5d6f1 in /usr/local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)

chaoqunxie · 2024-07-18T10:45:21Z

UltramanSleepless · 2024-07-22T03:51:30Z

chaoqunxie
需要训练多久我也是魔改的2080ti

chaoqunxie · 2024-07-23T12:48:07Z

chaoqunxie
需要训练多久我也是魔改的2080ti

step 3000 跑了4个小时15分钟

chaoqunxie added the bug Something isn't working label Jul 17, 2024

chaoqunxie closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

显卡在运行看不到训练进度[BUG] #390

显卡在运行看不到训练进度[BUG] #390

chaoqunxie commented Jul 17, 2024

chaoqunxie commented Jul 17, 2024

chaoqunxie commented Jul 17, 2024

chaoqunxie commented Jul 17, 2024

Stardust-minus commented Jul 17, 2024

chaoqunxie commented Jul 18, 2024

chaoqunxie commented Jul 18, 2024

UltramanSleepless commented Jul 22, 2024

chaoqunxie commented Jul 23, 2024 •

edited

Loading

显卡在运行 看不到训练进度[BUG] #390

显卡在运行 看不到训练进度[BUG] #390

Comments

chaoqunxie commented Jul 17, 2024

chaoqunxie commented Jul 17, 2024

chaoqunxie commented Jul 17, 2024

chaoqunxie commented Jul 17, 2024

Stardust-minus commented Jul 17, 2024

chaoqunxie commented Jul 18, 2024

chaoqunxie commented Jul 18, 2024

UltramanSleepless commented Jul 22, 2024

chaoqunxie commented Jul 23, 2024 • edited Loading

显卡在运行看不到训练进度[BUG] #390

显卡在运行看不到训练进度[BUG] #390

chaoqunxie commented Jul 23, 2024 •

edited

Loading