Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大佬,用单GPU没有报错,但是用Accclerate跑双GPU就报如下的错误: #6

Open
fmdmm opened this issue Jul 19, 2022 · 0 comments

Comments

@fmdmm
Copy link

fmdmm commented Jul 19, 2022

看到:“batch_outputs[0].cpu().numpy()”这块,但是单gpu为啥就没有问题,想不通呀
image

具体报错如下:
Training: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 640/640 [08:48<00:00, 1.22it/s]##--------------------- Dev
--------------------------------------------------------------------------------
f1 = 0.20078740157511782
precision = 0.22666666666701038
recall = 0.1802120141345653

**--------------------- Dev End
Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 322, in main
dev_metric = evaluate(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, kwargs)
File "train.py", line 48, in evaluate
outputs_gathered = postprocess_gplinker(
File "/sharedFolder/GPLinker_pytorch-dev/utils/postprocess.py", line 8, in postprocess_gplinker
batch_outputs[0].cpu().numpy(),
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f736ca5fd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7f736ccc24d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a2 (0x7f736ccc2ee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f736ca49314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: + 0x29e239 (0x7f73c9422239 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: + 0xadf291 (0x7f73c9c63291 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object
) + 0x292 (0x7f73c9c63592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: /usr/bin/python3() [0x5aee8a]
frame #8: /usr/bin/python3() [0x5ed1a0]
frame #9: /usr/bin/python3() [0x544188]
frame #10: /usr/bin/python3() [0x5441da]
frame #11: /usr/bin/python3() [0x5441da]
frame #12: /usr/bin/python3() [0x5441da]
frame #13: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #14: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #15: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #16: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #17: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #18: __libc_start_main + 0xf3 (0x7f73db4910b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb12e in /usr/bin/python3)

Traceback (most recent call last):
File "train.py", line 365, in
main()
File "train.py", line 352, in main
accelerator.wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 496, in wait_for_everyone
wait_for_everyone()
File "/usr/local/lib/python3.8/dist-packages/accelerate/utils.py", line 530, in wait_for_everyone
torch.distributed.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 2716, in barrier
work.wait()
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[W CUDAGuardImpl.h:113] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fd613fdbd62 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x1c4d3 (0x7fd61423e4d3 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a2 (0x7fd61423eee2 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7fd613fc5314 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: std::vector<at::Tensor, std::allocatorat::Tensor >::~vector() + 0x4a (0x7fd670de549a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x63 (0x7fd617714f33 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::WorkNCCL::~WorkNCCL() + 0x9 (0x7fd6177150c9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe6c6d6 (0x7fd67156c6d6 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0xe6c72a (0x7fd67156c72a in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2a6c10 (0x7fd6709a6c10 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: + 0x2a7e7e (0x7fd6709a7e7e in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x5ed1a0]
frame #12: /usr/bin/python3() [0x544188]
frame #13: /usr/bin/python3() [0x5441da]
frame #14: /usr/bin/python3() [0x5441da]
frame #15: /usr/bin/python3() [0x5441da]
frame #16: /usr/bin/python3() [0x5441da]
frame #17: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #18: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #19: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #20: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #21: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #22: __libc_start_main + 0xf3 (0x7fd682a0c0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: _start + 0x2e (0x5fb12e in /usr/bin/python3)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 31892) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 31893)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31893

Root Cause (first observed failure):
[0]:
time : 2022-07-19_03:27:00
host : 7fc5b780751c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 31892)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 31892

Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 41, in main
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 378, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 176, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '2', 'train.py', '--model_type', 'bert', '--pretrained_model_name_or_path', 'bert-base-chinese', '--method', 'gplinker', '--logging_steps', '200', '--num_train_epochs', '20', '--learning_rate', '3e-5', '--num_warmup_steps_or_radios', '0.1', '--gradient_accumulation_steps', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '32', '--seed', '42', '--save_steps', '10804', '--output_dir', './outputs', '--max_length', '128', '--topk', '1', '--num_workers', '8', '--model_cache_dir', '/mnt/f/hf/models']' returned non-zero exit status 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant