You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
具体报错如下:
Training: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 640/640 [08:48<00:00, 1.22it/s]##--------------------- Dev
--------------------------------------------------------------------------------
f1 = 0.20078740157511782
precision = 0.22666666666701038
recall = 0.1802120141345653
**--------------------- Dev End
Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 322, in main
dev_metric = evaluate(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, kwargs)
File "train.py", line 48, in evaluate
outputs_gathered = postprocess_gplinker(
File "/sharedFolder/GPLinker_pytorch-dev/utils/postprocess.py", line 8, in postprocess_gplinker
batch_outputs[0].cpu().numpy(),
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f736ca5fd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7f736ccc24d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7f736ccc2ee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f736ca49314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: + 0x29e239 (0x7f73c9422239 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadf291 (0x7f73c9c63291 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x292 (0x7f73c9c63592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3() [0x5aee8a]
frame #8: /usr/bin/python3() [0x5ed1a0]
frame #9: /usr/bin/python3() [0x544188]
frame #10: /usr/bin/python3() [0x5441da]
frame #11: /usr/bin/python3() [0x5441da]
frame #12: /usr/bin/python3() [0x5441da]
frame #13: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #14: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #15: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #16: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #17: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #18: __libc_start_main + 0xf3 (0x7f73db4910b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb12e in /usr/bin/python3)
Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 352, in main
accelerator.wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 496, in wait_for_everyone
wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils.py", line 530, in wait_for_everyone
torch.distributed.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2716, in barrier
work.wait()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[W CUDAGuardImpl.h:113] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd613fdbd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7fd61423e4d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7fd61423eee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fd613fc5314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x4a (0x7fd670de549a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x63 (0x7fd617714f33 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x9 (0x7fd6177150c9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe6c6d6 (0x7fd67156c6d6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0xe6c72a (0x7fd67156c72a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2a6c10 (0x7fd6709a6c10 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x2a7e7e (0x7fd6709a7e7e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x5ed1a0]
frame #12: /usr/bin/python3() [0x544188]
frame #13: /usr/bin/python3() [0x5441da]
frame #14: /usr/bin/python3() [0x5441da]
frame #15: /usr/bin/python3() [0x5441da]
frame #16: /usr/bin/python3() [0x5441da]
frame #17: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #18: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #19: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #20: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #21: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fd682a0c0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: _start + 0x2e (0x5fb12e in /usr/bin/python3)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 31892) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 31893)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31893
Root Cause (first observed failure):
[0]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 31892)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31892
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 378, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '2', 'train.py', '--model_type', 'bert', '--pretrained_model_name_or_path', 'bert-base-chinese', '--method', 'gplinker', '--logging_steps', '200', '--num_train_epochs', '20', '--learning_rate', '3e-5', '--num_warmup_steps_or_radios', '0.1', '--gradient_accumulation_steps', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '32', '--seed', '42', '--save_steps', '10804', '--output_dir', './outputs', '--max_length', '128', '--topk', '1', '--num_workers', '8', '--model_cache_dir', '/mnt/f/hf/models']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered:
看到:“batch_outputs[0].cpu().numpy()”这块,但是单gpu为啥就没有问题,想不通呀

具体报错如下:
Training: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 640/640 [08:48<00:00, 1.22it/s]##--------------------- Dev
--------------------------------------------------------------------------------
f1 = 0.20078740157511782
precision = 0.22666666666701038
recall = 0.1802120141345653
**--------------------- Dev End
Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 322, in main
dev_metric = evaluate(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, kwargs)
File "train.py", line 48, in evaluate
outputs_gathered = postprocess_gplinker(
File "/sharedFolder/GPLinker_pytorch-dev/utils/postprocess.py", line 8, in postprocess_gplinker
batch_outputs[0].cpu().numpy(),
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f736ca5fd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7f736ccc24d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7f736ccc2ee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f736ca49314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: + 0x29e239 (0x7f73c9422239 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadf291 (0x7f73c9c63291 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x292 (0x7f73c9c63592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3() [0x5aee8a]
frame #8: /usr/bin/python3() [0x5ed1a0]
frame #9: /usr/bin/python3() [0x544188]
frame #10: /usr/bin/python3() [0x5441da]
frame #11: /usr/bin/python3() [0x5441da]
frame #12: /usr/bin/python3() [0x5441da]
frame #13: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #14: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #15: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #16: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #17: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #18: __libc_start_main + 0xf3 (0x7f73db4910b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb12e in /usr/bin/python3)
Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 352, in main
accelerator.wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 496, in wait_for_everyone
wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils.py", line 530, in wait_for_everyone
torch.distributed.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2716, in barrier
work.wait()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[W CUDAGuardImpl.h:113] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd613fdbd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7fd61423e4d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7fd61423eee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fd613fc5314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x4a (0x7fd670de549a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x63 (0x7fd617714f33 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x9 (0x7fd6177150c9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe6c6d6 (0x7fd67156c6d6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0xe6c72a (0x7fd67156c72a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2a6c10 (0x7fd6709a6c10 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x2a7e7e (0x7fd6709a7e7e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x5ed1a0]
frame #12: /usr/bin/python3() [0x544188]
frame #13: /usr/bin/python3() [0x5441da]
frame #14: /usr/bin/python3() [0x5441da]
frame #15: /usr/bin/python3() [0x5441da]
frame #16: /usr/bin/python3() [0x5441da]
frame #17: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #18: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #19: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #20: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #21: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fd682a0c0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: _start + 0x2e (0x5fb12e in /usr/bin/python3)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 31892) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 31893)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31893
Root Cause (first observed failure):
[0]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 31892)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31892
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 378, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '2', 'train.py', '--model_type', 'bert', '--pretrained_model_name_or_path', 'bert-base-chinese', '--method', 'gplinker', '--logging_steps', '200', '--num_train_epochs', '20', '--learning_rate', '3e-5', '--num_warmup_steps_or_radios', '0.1', '--gradient_accumulation_steps', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '32', '--seed', '42', '--save_steps', '10804', '--output_dir', './outputs', '--max_length', '128', '--topk', '1', '--num_workers', '8', '--model_cache_dir', '/mnt/f/hf/models']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered: