大佬，用单GPU没有报错，但是用Accclerate跑双GPU就报如下的错误： #6

fmdmm · 2022-07-19T07:43:22Z

看到：“batch_outputs[0].cpu().numpy()”这块，但是单gpu为啥就没有问题，想不通呀

具体报错如下：
Training: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 640/640 [08:48<00:00, 1.22it/s]##--------------------- Dev
--------------------------------------------------------------------------------
f1 = 0.20078740157511782
precision = 0.22666666666701038
recall = 0.1802120141345653

**--------------------- Dev End
Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 322, in main
dev_metric = evaluate(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, kwargs)
File "train.py", line 48, in evaluate
outputs_gathered = postprocess_gplinker(
File "/sharedFolder/GPLinker_pytorch-dev/utils/postprocess.py", line 8, in postprocess_gplinker
batch_outputs[0].cpu().numpy(),
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f736ca5fd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7f736ccc24d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a2 (0x7f736ccc2ee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f736ca49314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: + 0x29e239 (0x7f73c9422239 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadf291 (0x7f73c9c63291 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object) + 0x292 (0x7f73c9c63592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3() [0x5aee8a]
frame #8: /usr/bin/python3() [0x5ed1a0]
frame #9: /usr/bin/python3() [0x544188]
frame #10: /usr/bin/python3() [0x5441da]
frame #11: /usr/bin/python3() [0x5441da]
frame #12: /usr/bin/python3() [0x5441da]
frame #13: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #14: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #15: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #16: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #17: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #18: __libc_start_main + 0xf3 (0x7f73db4910b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb12e in /usr/bin/python3)

Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 352, in main
accelerator.wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 496, in wait_for_everyone
wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils.py", line 530, in wait_for_everyone
torch.distributed.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2716, in barrier
work.wait()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[W CUDAGuardImpl.h:113] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd613fdbd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7fd61423e4d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7fd61423eee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fd613fc5314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x4a (0x7fd670de549a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x63 (0x7fd617714f33 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x9 (0x7fd6177150c9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe6c6d6 (0x7fd67156c6d6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0xe6c72a (0x7fd67156c72a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2a6c10 (0x7fd6709a6c10 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x2a7e7e (0x7fd6709a7e7e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x5ed1a0]
frame #12: /usr/bin/python3() [0x544188]
frame #13: /usr/bin/python3() [0x5441da]
frame #14: /usr/bin/python3() [0x5441da]
frame #15: /usr/bin/python3() [0x5441da]
frame #16: /usr/bin/python3() [0x5441da]
frame #17: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #18: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #19: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #20: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #21: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fd682a0c0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: _start + 0x2e (0x5fb12e in /usr/bin/python3)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 31892) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 31893)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31893

Root Cause (first observed failure):
[0]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 31892)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31892

Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 378, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '2', 'train.py', '--model_type', 'bert', '--pretrained_model_name_or_path', 'bert-base-chinese', '--method', 'gplinker', '--logging_steps', '200', '--num_train_epochs', '20', '--learning_rate', '3e-5', '--num_warmup_steps_or_radios', '0.1', '--gradient_accumulation_steps', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '32', '--seed', '42', '--save_steps', '10804', '--output_dir', './outputs', '--max_length', '128', '--topk', '1', '--num_workers', '8', '--model_cache_dir', '/mnt/f/hf/models']' returned non-zero exit status 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

大佬，用单GPU没有报错，但是用Accclerate跑双GPU就报如下的错误： #6

大佬，用单GPU没有报错，但是用Accclerate跑双GPU就报如下的错误： #6

fmdmm commented Jul 19, 2022

大佬，用单GPU没有报错，但是用Accclerate跑双GPU就报如下的错误： #6

大佬，用单GPU没有报错，但是用Accclerate跑双GPU就报如下的错误： #6

Comments

fmdmm commented Jul 19, 2022

train.py FAILED

Failures: [1]: time : 2022-07-19_03:27:00 host : 7fc5b780751c rank : 1 (local_rank: 1) exitcode : -6 (pid: 31893) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 31893

Root Cause (first observed failure): [0]: time : 2022-07-19_03:27:00 host : 7fc5b780751c rank : 0 (local_rank: 0) exitcode : -6 (pid: 31892) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 31892

Failures:
[1]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 31893)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31893

Root Cause (first observed failure):
[0]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 31892)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31892