Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cargo executed over SSH gets stuck when its connection is dropped #8714

Open
link2xt opened this issue Sep 18, 2020 · 11 comments
Open

cargo executed over SSH gets stuck when its connection is dropped #8714

link2xt opened this issue Sep 18, 2020 · 11 comments
Labels
C-bug Category: bug S-triage Status: This issue is waiting on initial triage.

Comments

@link2xt
Copy link

link2xt commented Sep 18, 2020

CircleCI at https://github.com/deltachat/deltachat-core-rust/ is set up to run this process remotely via SSH on another CI server:

/home/ci/.rustup/toolchains/1.45.0-x86_64-unknown-linux-gnu/bin/cargo test --all --target=x86_64-unknown-linux-gnu

with a target directory set via CARGO_TARGET_DIR per-branch.
You can see the script at https://github.com/deltachat/deltachat-core-rust/blob/master/ci_scripts/remote_tests_rust.sh for reference.

When CI job is cancelled, e.g. because someone force-pushed to the branch, SSH connection to remote CI server is closed. cargo keeps running, and holds the lock.

Edit: read the second post, the problem appears even after switching to per-build target directories, so it should be easier to understand what happens. There is only one cargo command executed in the target directory, yet it gets stuck.

Here is an example of the process: ``` $ ls -la /proc/9697/fd total 0 dr-x------ 2 ci ci 0 Sep 18 21:18 . dr-xr-xr-x 9 ci ci 0 Sep 18 20:19 .. lr-x------ 1 ci ci 64 Sep 18 21:18 0 -> 'pipe:[25969510]' l-wx------ 1 ci ci 64 Sep 18 21:18 1 -> 'pipe:[25969511]' l-wx------ 1 ci ci 64 Sep 18 21:18 2 -> 'pipe:[25969512]' lrwx------ 1 ci ci 64 Sep 18 21:18 3 -> 'socket:[25975864]' lrwx------ 1 ci ci 64 Sep 18 21:18 4 -> 'socket:[25975865]' lr-x------ 1 ci ci 64 Sep 18 21:18 5 -> 'pipe:[25975866]' l-wx------ 1 ci ci 64 Sep 18 21:18 6 -> 'pipe:[25975866]' lrwx------ 1 ci ci 64 Sep 18 21:18 7 -> /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/debug/.cargo-lock ``` 9697 is a cargo process.

Then another CI job start another cargo process 30694 and it deadlocks too, holding two locks:

$ ls -la /proc/30694/fd
total 0
dr-x------ 2 ci ci  0 Sep 18 21:18 .
dr-xr-xr-x 9 ci ci  0 Sep 18 20:18 ..
lr-x------ 1 ci ci 64 Sep 18 21:18 0 -> 'pipe:[25969005]'
l-wx------ 1 ci ci 64 Sep 18 21:18 1 -> 'pipe:[25969006]'
l-wx------ 1 ci ci 64 Sep 18 21:18 2 -> 'pipe:[25969007]'
lrwx------ 1 ci ci 64 Sep 18 21:18 3 -> 'socket:[25971805]'
lrwx------ 1 ci ci 64 Sep 18 21:18 4 -> 'socket:[25971806]'
lr-x------ 1 ci ci 64 Sep 18 21:18 5 -> 'pipe:[25971807]'
l-wx------ 1 ci ci 64 Sep 18 21:18 6 -> 'pipe:[25971807]'
lrwx------ 1 ci ci 64 Sep 18 21:18 7 -> /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/debug/.cargo-lock
lrwx------ 1 ci ci 64 Sep 18 21:18 8 -> /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/x86_64-unknown-linux-gnu/debug/.cargo-lock

At this point, locks look like this:

$ fuser /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/debug/.cargo-lock
/home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/debug/.cargo-lock:  9697 30694

$ fuser /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/x86_64-unknown-linux-gnu/debug/.cargo-lock
/home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/x86_64-unknown-linux-gnu/debug/.cargo-lock: 30694

Then I killed 30694, to release the lock /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/x86_64-unknown-linux-gnu/debug/.cargo-lock

After than, first cargo process 9697 opened the second lock file:

$ ls -l /proc/9697/fd
total 0
lr-x------ 1 ci ci 64 Sep 18 21:18 0 -> 'pipe:[25969510]'
l-wx------ 1 ci ci 64 Sep 18 21:18 1 -> 'pipe:[25969511]'
l-wx------ 1 ci ci 64 Sep 18 21:18 2 -> 'pipe:[25969512]'
lrwx------ 1 ci ci 64 Sep 18 21:18 3 -> 'socket:[25975864]'
lrwx------ 1 ci ci 64 Sep 18 21:18 4 -> 'socket:[25975865]'
lr-x------ 1 ci ci 64 Sep 18 21:18 5 -> 'pipe:[25975866]'
l-wx------ 1 ci ci 64 Sep 18 21:18 6 -> 'pipe:[25975866]'
lrwx------ 1 ci ci 64 Sep 18 21:18 7 -> /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/debug/.cargo-lock
lrwx------ 1 ci ci 64 Sep 18 22:09 8 -> /home/ci/ci_builds/deltachat-core-rust/temp-store/remote_tests_rust/target-rust/x86_64-unknown-linux-gnu/debug/.cargo-lock

I connected to 9697 and collected backtraces:

(gdb) bt
#0  0x00007fa754991ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7ffdc6f0ed40, expected=0, futex_word=0x55822666e6a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7ffdc6f0ee18, mutex=0x55822737ea50, cond=0x55822666e680) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x55822666e680, mutex=0x55822737ea50, abstime=0x7ffdc6f0ee18) at pthread_cond_wait.c:667
#3  0x0000558224ef7168 in std::sys::unix::condvar::Condvar::wait_timeout () at src/libstd/sys/unix/condvar.rs:100
#4  0x0000558224791f81 in std::sync::condvar::Condvar::wait_timeout_while ()
#5  0x0000558224a0dda1 in cargo::util::queue::Queue<T>::pop ()
#6  0x00005582247c439e in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#7  0x0000558224690a5f in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
#8  0x000055822469395d in crossbeam_utils::thread::scope ()
#9  0x00005582247c232f in cargo::core::compiler::job_queue::JobQueue::execute ()
#10 0x00005582246f4fc5 in cargo::core::compiler::context::Context::compile ()
#11 0x00005582246feb18 in cargo::ops::cargo_compile::compile_ws ()
#12 0x00005582246fe85e in cargo::ops::cargo_compile::compile ()
#13 0x0000558224998119 in cargo::ops::cargo_test::run_tests ()
#14 0x000055822460a7cf in cargo::commands::test::exec ()
#15 0x000055822461e3e5 in cargo::cli::main ()
#16 0x0000558224629fe5 in cargo::main ()
#17 0x0000558224603ea3 in std::rt::lang_start::{{closure}} ()
#18 0x0000558224ef5b08 in std::rt::lang_start_internal::{{closure}} () at src/libstd/rt.rs:52
#19 std::panicking::try::do_call () at src/libstd/panicking.rs:297
#20 std::panicking::try () at src/libstd/panicking.rs:274
#21 std::panic::catch_unwind () at src/libstd/panic.rs:394
#22 std::rt::lang_start_internal () at src/libstd/rt.rs:51
#23 0x000055822462c4c2 in main ()

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7fa7551c1b00 (LWP 9697) "cargo" 0x00007fa754991ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7ffdc6f0ed40, expected=0, futex_word=0x55822666e6a8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  2    Thread 0x7fa753e71700 (LWP 11622) "cargo" 0x00007fa75448fcf9 in __GI___poll (fds=0x7fa753e70970, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa753e71700 (LWP 11622))]
#0  0x00007fa75448fcf9 in __GI___poll (fds=0x7fa753e70970, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) bt
#0  0x00007fa75448fcf9 in __GI___poll (fds=0x7fa753e70970, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x0000558224ec65d6 in jobserver::imp::Client::acquire_allow_interrupts ()
#2  0x0000558224ec810f in jobserver::HelperState::for_each_request ()
#3  0x0000558224ec830c in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4  0x0000558224ec90b8 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5  0x0000558224eff71a in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#6  <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#7  std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#8  0x00007fa75498b6db in start_thread (arg=0x7fa753e71700) at pthread_create.c:463
#9  0x00007fa75449ca3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

So it looks like two concurrent cargo processes trying to get two locks concurrently locked each other. But after killing the second process, first process remains locked. I have collected backtraces, maybe it is possible to infer from them the state of this process and why it can't proceed.

I have found a similar closed issue #7200

@link2xt link2xt added the C-bug Category: bug label Sep 18, 2020
@link2xt
Copy link
Author

link2xt commented Sep 19, 2020

Even with a separate build directory, one cargo process managed to get stuck when its SSH connection was dropped:

pstree output:

systemd,1
  |-bash,5010
  |   `-bash,5014 ci_scripts/run-rust-test.sh
  |       `-cargo,31534 test --all --release --target=x86_64-unknown-linux-gnu
  |           `-{cargo},31535
$ gdb -q -p 31535
Attaching to process 31535
Reading symbols from /home/ci/.rustup/toolchains/1.45.0-x86_64-unknown-linux-gnu/bin/cargo...done.
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.27.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/librt-2.27.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...Reading symbols from /usr/lib/debug/.build-id/bc/3c06107774266c5f7db3f1f380a3da68af90fa.debug...done.
done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...(no debugging symbols found)...done.
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.27.so...done.
done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.27.so...done.
done.
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libm-2.27.so...done.
done.
0x00007fcf922df9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5560f2feaf78) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fcf917bf700 (LWP 31535) "cargo" 0x00007fcf922df9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5560f2feaf78) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
(gdb) bt
#0  0x00007fcf922df9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5560f2feaf78) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5560f358e950, cond=0x5560f2feaf50) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5560f2feaf50, mutex=0x5560f358e950) at pthread_cond_wait.c:655
#3  0x00005560f0b351ad in jobserver::HelperState::for_each_request ()
#4  0x00005560f0b3530c in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#5  0x00005560f0b360b8 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#6  0x00005560f0b6c71a in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#7  <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#8  std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#9  0x00007fcf922d96db in start_thread (arg=0x7fcf917bf700) at pthread_create.c:463
#10 0x00007fcf91deaa3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
$ sudo ls -la /proc/31535/fd/
total 0
dr-x------ 2 ci ci  0 Sep 19 19:26 .
dr-xr-xr-x 9 ci ci  0 Sep 19 19:26 ..
lr-x------ 1 ci ci 64 Sep 19 19:34 0 -> 'pipe:[28191488]'
l-wx------ 1 ci ci 64 Sep 19 19:34 1 -> 'pipe:[28191489]'
l-wx------ 1 ci ci 64 Sep 19 19:34 2 -> 'pipe:[28191490]'
lrwx------ 1 ci ci 64 Sep 19 19:34 3 -> 'socket:[28206371]'
lrwx------ 1 ci ci 64 Sep 19 19:34 4 -> 'socket:[28206372]'
lr-x------ 1 ci ci 64 Sep 19 19:34 5 -> 'pipe:[28209269]'
l-wx------ 1 ci ci 64 Sep 19 19:34 6 -> 'pipe:[28209269]'
lrwx------ 1 ci ci 64 Sep 19 19:34 7 -> /home/ci/ci_builds/deltachat-core-rust/fix-contacts-changed-docs/remote_tests_rust/27077/target/release/.cargo-lock
lrwx------ 1 ci ci 64 Sep 19 19:34 8 -> /home/ci/ci_builds/deltachat-core-rust/fix-contacts-changed-docs/remote_tests_rust/27077/target/x86_64-unknown-linux-gnu/release/.cargo-lock

Command line:

/home/ci/.rustup/toolchains/1.45.0-x86_64-unknown-linux-gnu/bin/cargo test --all --release --target=x86_64-unknown-linux-gnu

From FD 5, you can read this once:

$ cat /proc/31535/fd/5
|||||||||||

Other FDs don't emit anyhing.

Parent process backtrace:

$ gdb -q -p 31534
Attaching to process 31534
[New LWP 31535]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fcf922dfed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7fff436d3ba0, expected=0, futex_word=0x5560f2e90348) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
142     ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7fcf92b0fb00 (LWP 31534) "cargo" 0x00007fcf922dfed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7fff436d3ba0, expected=0, futex_word=0x5560f2e90348)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  2    Thread 0x7fcf917bf700 (LWP 31535) "cargo" 0x00007fcf922df9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5560f2feaf78) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
(gdb) bt
#0  0x00007fcf922dfed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7fff436d3ba0, expected=0, futex_word=0x5560f2e90348) at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7fff436d3c78, mutex=0x5560f34ed100, cond=0x5560f2e90320) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x5560f2e90320, mutex=0x5560f34ed100, abstime=0x7fff436d3c78) at pthread_cond_wait.c:667
#3  0x00005560f0b64168 in std::sys::unix::condvar::Condvar::wait_timeout () at src/libstd/sys/unix/condvar.rs:100
#4  0x00005560f03fef81 in std::sync::condvar::Condvar::wait_timeout_while ()
#5  0x00005560f067ada1 in cargo::util::queue::Queue<T>::pop ()
#6  0x00005560f043139e in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#7  0x00005560f02fda5f in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
#8  0x00005560f030095d in crossbeam_utils::thread::scope ()
#9  0x00005560f042f32f in cargo::core::compiler::job_queue::JobQueue::execute ()
#10 0x00005560f0361fc5 in cargo::core::compiler::context::Context::compile ()
#11 0x00005560f036bb18 in cargo::ops::cargo_compile::compile_ws ()
#12 0x00005560f036b85e in cargo::ops::cargo_compile::compile ()
#13 0x00005560f0605119 in cargo::ops::cargo_test::run_tests ()
#14 0x00005560f02777cf in cargo::commands::test::exec ()
#15 0x00005560f028b3e5 in cargo::cli::main ()
#16 0x00005560f0296fe5 in cargo::main ()
#17 0x00005560f0270ea3 in std::rt::lang_start::{{closure}} ()
#18 0x00005560f0b62b08 in std::rt::lang_start_internal::{{closure}} () at src/libstd/rt.rs:52
#19 std::panicking::try::do_call () at src/libstd/panicking.rs:297
#20 std::panicking::try () at src/libstd/panicking.rs:274
#21 std::panic::catch_unwind () at src/libstd/panic.rs:394
#22 std::rt::lang_start_internal () at src/libstd/rt.rs:51
#23 0x00005560f02994c2 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x7fcf917bf700 (LWP 31535))]
#0  0x00007fcf922df9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5560f2feaf78) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88      in ../sysdeps/unix/sysv/linux/futex-internal.h
(gdb) bt
#0  0x00007fcf922df9f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5560f2feaf78) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5560f358e950, cond=0x5560f2feaf50) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5560f2feaf50, mutex=0x5560f358e950) at pthread_cond_wait.c:655
#3  0x00005560f0b351ad in jobserver::HelperState::for_each_request ()
#4  0x00005560f0b3530c in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#5  0x00005560f0b360b8 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#6  0x00005560f0b6c71a in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#7  <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#8  std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#9  0x00007fcf922d96db in start_thread (arg=0x7fcf917bf700) at pthread_create.c:463
#10 0x00007fcf91deaa3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Parent process has same command line, here is its fds, same as the child process:

$ ls -la /proc/31534/fd
total 0
dr-x------ 2 ci ci  0 Sep 19 19:42 .
dr-xr-xr-x 9 ci ci  0 Sep 19 18:51 ..
lr-x------ 1 ci ci 64 Sep 19 19:42 0 -> 'pipe:[28191488]'
l-wx------ 1 ci ci 64 Sep 19 19:42 1 -> 'pipe:[28191489]'
l-wx------ 1 ci ci 64 Sep 19 19:42 2 -> 'pipe:[28191490]'
lrwx------ 1 ci ci 64 Sep 19 19:42 3 -> 'socket:[28206371]'
lrwx------ 1 ci ci 64 Sep 19 19:42 4 -> 'socket:[28206372]'
lr-x------ 1 ci ci 64 Sep 19 19:42 5 -> 'pipe:[28209269]'
l-wx------ 1 ci ci 64 Sep 19 19:42 6 -> 'pipe:[28209269]'
lrwx------ 1 ci ci 64 Sep 19 19:42 7 -> /home/ci/ci_builds/deltachat-core-rust/fix-contacts-changed-docs/remote_tests_rust/27077/target/release/.cargo-lock
lrwx------ 1 ci ci 64 Sep 19 19:42 8 -> /home/ci/ci_builds/deltachat-core-rust/fix-contacts-changed-docs/remote_tests_rust/27077/target/x86_64-unknown-linux-gnu/release/.cargo-lock

@link2xt link2xt changed the title Two cargo processes running in the same directory deadlock cargo executed over SSH gets stuck when its connection is dropped Sep 19, 2020
@alexcrichton
Copy link
Member

Thanks for the report! I personally primarily work by ssh'ing into a computer somewhere else, and for years my compilations will randomly get stuck and I've got to ctrl-c to make any progress. I've always written it off as weirdness, but I wonder if this is related!

This issue was first about concurrent Cargo processes, but that seems like it's no longer the case? You're able to get Cargo stuck by just killing an ssh connection? So are you effectively starting a build, killing the SSH connection (e.g. via kill?), and then Cargo hangs and doesn't make any progress?

@link2xt
Copy link
Author

link2xt commented Sep 29, 2020

So are you effectively starting a build, killing the SSH connection (e.g. via kill?), and then Cargo hangs and doesn't make any progress?

SSH connection is started from some CircleCI worker. Relevant config line is at https://github.com/deltachat/deltachat-core-rust/blob/7ddf3ba754068dc80f67c2f36433fb1c825409be/.circleci/config.yml#L145

SSH connection from withing the bash script started at CircleCI: https://github.com/deltachat/deltachat-core-rust/blob/7ddf3ba754068dc80f67c2f36433fb1c825409be/ci_scripts/remote_tests_rust.sh#L19

When CircleCI cancels the job, it probably kills bash and ssh on its side, I don't know how exactly it happens. Maybe it's not even kill, but some virtual network interface is removed from docker bridge or something like this. But cargo is probably losing its stdin, stdout and stderr, handled by sshd.

@alexcrichton
Copy link
Member

Hm so for the last stack trace you posted has two stuck threads, one is the "jobserver helper thread" which just infinitely waits for tokens (thread 2) so that's expected to be blocked. The first thread, however, is the main event loop of Cargo as it waits for messages from subprocesses to handle. The lack of remaining threads, however, means that Cargo's not actually waiting on anything since everything exited.

The only thing I can guess is that Cargo is handling a signal from the shell that nothing else is. My best guess for this is SIGHUP since poking around some other sites seems to indicate that's what's sent when the remote end disconnects from an SSH session. I don't think, though, that Cargo is handling that signal. I don't know why Cargo wouldn't be dying from that signal.

Would it be possible to set up some more debugging on your end? Ideally you'd set CARGO_LOG=trace for the process which eventually hangs, and then when it's hanging and not returning can you gist the output of the logs?

@Arnavion
Copy link

The drain_the_queue stack is easily reproducible for me by running cargo build 2>&1 | head -n 1 :

$ clear; rm -rf target/ && cargo build 2>&1 | head -n 1

   Compiling libc v0.2.86
# hangs here

$ pidof cargo

130492

$ pidof rustc

# only one cargo process and no rustc processes

$ rust-gdb --pid "$(pidof cargo)"

(gdb) thread apply all bt

Thread 2 (LWP 105180 "cargo"):
#0  0x00007fc2ac530aaa in __futex_abstimed_wait_common64 () from /lib64/libpthread.so.0
#1  0x00007fc2ac52a270 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x000055a0c3a62e71 in jobserver::HelperState::for_each_request ()
#3  0x000055a0c3a6318c in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4  0x000055a0c3a63c4f in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5  0x000055a0c3aaa41a in alloc::boxed::{{impl}}::call_once<(),FnOnce<()>,alloc::alloc::Global> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/alloc/src/boxed.rs:1328
#6  alloc::boxed::{{impl}}::call_once<(),alloc::boxed::Box<FnOnce<()>, alloc::alloc::Global>,alloc::alloc::Global> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/alloc/src/boxed.rs:1328
#7  std::sys::unix::thread::{{impl}}::new::thread_start () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/thread.rs:71
#8  0x00007fc2ac524299 in start_thread () from /lib64/libpthread.so.0
#9  0x00007fc2ac300af3 in clone () from /lib64/libc.so.6

Thread 1 (LWP 105166 "cargo"):
#0  0x00007fc2ac530aaa in __futex_abstimed_wait_common64 () from /lib64/libpthread.so.0
#1  0x00007fc2ac52a584 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x000055a0c3aa33c8 in std::sys::unix::condvar::Condvar::wait_timeout () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/condvar.rs:98
#3  0x000055a0c338ea77 in std::sync::condvar::Condvar::wait_timeout_while ()
#4  0x000055a0c321f821 in cargo::util::queue::Queue<T>::pop ()
#5  0x000055a0c33d5903 in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#6  0x000055a0c338ecd3 in std::panic::catch_unwind ()
#7  0x000055a0c32fdafe in crossbeam_utils::thread::scope ()
#8  0x000055a0c33d3856 in cargo::core::compiler::job_queue::JobQueue::execute ()
#9  0x000055a0c3285a15 in cargo::core::compiler::context::Context::compile ()
#10 0x000055a0c350dd01 in cargo::ops::cargo_compile::compile_ws ()
#11 0x000055a0c350da5e in cargo::ops::cargo_compile::compile ()
#12 0x000055a0c317f67d in cargo::commands::build::exec ()
#13 0x000055a0c31219f9 in cargo::cli::main ()
#14 0x000055a0c3189438 in cargo::main ()
#15 0x000055a0c3178333 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#16 0x000055a0c3178359 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::ha5e70e382477cfc4 ()
#17 0x000055a0c3aa1177 in core::ops::function::impls::{{impl}}::call_once<(),Fn<()>> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/core/src/ops/function.rs:259
#18 std::panicking::try::do_call<&Fn<()>,i32> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panicking.rs:379
#19 std::panicking::try<i32,&Fn<()>> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panicking.rs:343
#20 std::panic::catch_unwind<&Fn<()>,i32> () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panic.rs:396
#21 std::rt::lang_start_internal () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/rt.rs:51
#22 0x000055a0c318bc82 in main ()

This is with:

cargo 1.50.0 (f04e7fab7 2021-02-04)
release: 1.50.0
commit-hash: f04e7fab73128592a4063983c302da788bdfaba5
commit-date: 2021-02-04

... on Linux.

Before I attached the debugger I thought it might be triggered by cargo unable to write to stdout. My actual use case is to run watch -- cargo build as part of my development workflow, and it has the same problem if there's more than one screenful of output from cargo build. Both head and watch stop reading cargo build's stdout (head by closing its stdin, watch by closing the reader end of the pipefd it connected to the child process's stdout) and just wait for the process to exit.

But given none of the threads are blocked on anything related to writing to stdout, I'm not sure that's the reason.

@Arnavion
Copy link

I strace'd the thing along with CARGO_LOG=trace. The trace logs would also go to stderr and be ignored, but at least the strace would capture them.

rm -rf target/ && CARGO_LOG=trace strace -fo ~/foo/cargo.log -s 9999 -- ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/cargo build 2>&1 | head -n1

Near the end, this gives:

37199 write(2, "[2021-02-23T23:34:56Z INFO  cargo::core::compiler::job_queue] start 0: Unit { pkg: Package { id: PackageId { name: \"autocfg\", version: \"1.0.1\", source: \"registry `https://github.com/rust-lang/crates.io-index`\" }, ..: \"..\" }, target: TargetInner { ..: lib_target(\"autocfg\", [\"lib\"], \"/home/arnavion/.cargo/registry/src/github.com-1ecc6299db9ec823/autocfg-1.0.1/src/lib.rs\", Edition2015) }, profile: Profile { incremental: false, ..: default_dev() }, kind: Host, mode: Build, features: [], is_std: false, dep_hash: 2202906307356721367 }\n", 535 <unfinished ...>

37199 <... write resumed>)              = -1 EPIPE (Broken pipe)

37199 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=37199, si_uid=1000} ---

37199 write(2, " ", 1 <unfinished ...>

37199 <... write resumed>)              = -1 EPIPE (Broken pipe)

37199 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=37199, si_uid=1000} ---

37199 getcwd("/home/arnavion/src/rust-test", 512) = 29

37199 write(2, "[2021-02-23T23:34:56Z DEBUG cargo] display_error; err=Broken pipe (os error 32)\n    \n    Stack backtrace:\n       0: cargo::core::shell::ShellOut::message_stderr\n       1: cargo::core::shell::Shell::status\n       2: cargo::core::compiler::job_queue::DrainState::drain_the_queue\n       3: std::panic::catch_unwind\n       4: crossbeam_utils::thread::scope\n       5: cargo::core::compiler::job_queue::JobQueue::execute\n       6: cargo::core::compiler::context::Context::compile\n       7: cargo::ops::cargo_compile::compile_ws\n       8: cargo::ops::cargo_compile::compile\n       9: cargo::commands::build::exec\n      10: cargo::cli::main\n      11: cargo::main\n      12: std::sys_common::backtrace::__rust_begin_short_backtrace\n      13: std::rt::lang_start::{{closure}}\n      14: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once\n                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/core/src/ops/function.rs:259:13\n          std::panicking::try::do_call\n                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/panicking.rs:379:40\n          std::panicking::try\n                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/panicking.rs:343:19\n          std::panic::catch_unwind\n                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/panic.rs:396:14\n          std::rt::lang_start_internal\n                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/rt.rs:51:25\n      15: main\n      16: __libc_start_main\n      17: <unknown>\n", 1577) = -1 EPIPE (Broken pipe)

37199 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=37199, si_uid=1000} ---

37199 write(2, "error", 5)              = -1 EPIPE (Broken pipe)

37199 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=37199, si_uid=1000} ---

37199 write(2, "warning", 7)            = -1 EPIPE (Broken pipe)

37199 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=37199, si_uid=1000} ---

37199 write(2, "[2021-02-23T23:34:56Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 1)\n", 139) = -1 EPIPE (Broken pipe)

37199 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=37199, si_uid=1000} ---

37199 write(2, "[2021-02-23T23:34:56Z INFO  cargo::core::compiler::job_queue] tokens in use: 1, rustc_tokens: [], waiting_rustcs: [] (events this tick: 0)\n", 139) = -1 EPIPE (Broken pipe)

37199 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=37199, si_uid=1000} ---

37199 write(6, "|", 1)                  = 1

37199 futex(0x5646e7290358, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=223467, tv_nsec=337539739}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)

37199 futex(0x5646e71856c0, FUTEX_WAKE_PRIVATE, 1) = 0

The futex syscalls then repeat forever.

So the traces it wrote there are:

[2021-02-23T23:34:56Z INFO  cargo::core::compiler::job_queue] start 0: Unit { pkg: Package { id: PackageId { name: "autocfg", version: "1.0.1", source: "registry `https://github.com/rust-lang/crates.io-index`" }, ..: ".." }, target: TargetInner { ..: lib_target("autocfg", ["lib"], "/home/arnavion/.cargo/registry/src/github.com-1ecc6299db9ec823/autocfg-1.0.1/src/lib.rs", Edition2015) }, profile: Profile { incremental: false, ..: default_dev() }, kind: Host, mode: Build, features: [], is_std: false, dep_hash: 2202906307356721367 }

[2021-02-23T23:34:56Z DEBUG cargo] display_error; err=Broken pipe (os error 32)
    
    Stack backtrace:
       0: cargo::core::shell::ShellOut::message_stderr
       1: cargo::core::shell::Shell::status
       2: cargo::core::compiler::job_queue::DrainState::drain_the_queue
       3: std::panic::catch_unwind
       4: crossbeam_utils::thread::scope
       5: cargo::core::compiler::job_queue::JobQueue::execute
       6: cargo::core::compiler::context::Context::compile
       7: cargo::ops::cargo_compile::compile_ws
       8: cargo::ops::cargo_compile::compile
       9: cargo::commands::build::exec
      10: cargo::cli::main
      11: cargo::main
      12: std::sys_common::backtrace::__rust_begin_short_backtrace
      13: std::rt::lang_start::{{closure}}
      14: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/core/src/ops/function.rs:259:13
          std::panicking::try::do_call
                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/panicking.rs:379:40
          std::panicking::try
                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/panicking.rs:343:19
          std::panic::catch_unwind
                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/panic.rs:396:14
          std::rt::lang_start_internal
                 at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/std/src/rt.rs:51:25
      15: main
      16: __libc_start_main
      17: <unknown>

error

warning

[2021-02-23T23:34:56Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 1)

[2021-02-23T23:34:56Z INFO  cargo::core::compiler::job_queue] tokens in use: 1, rustc_tokens: [], waiting_rustcs: [] (events this tick: 0)

@ehuss
Copy link
Contributor

ehuss commented Feb 23, 2021

@Arnavion Thanks for the info! I see what is wrong, and I'll post a fix shortly. There is protection for handling errors in this case, and I missed one ? that shouldn't be used.

@ehuss
Copy link
Contributor

ehuss commented Feb 24, 2021

Posted #9201 to resolve the hang error. I'm skeptical it addresses the issue originally raised in this issue, but it is hard to say without a specific reproduction.

bors added a commit that referenced this issue Feb 24, 2021
Fix hang on broken stderr.

If stderr is closed and cargo tries to print a status message such as "Compiling", cargo can get into a hung state. This is because the `DrainState::run` function would insert the job into the `active` queue, and then return an error without starting the job. Since the job isn't started, there is nothing to remove it from the `active` queue, and thus cargo hangs forever waiting for it to finish.

The solution is to move the call to `note_working_on` earlier before the job is placed into the active queue.

This addresses the issue noted in #8714 (comment).
@link2xt
Copy link
Author

link2xt commented Sep 1, 2023

I run cargo in Concourse CI now and sometimes it gets stuck without using CPU if nobody reads the logs. If I watch the log with fly -t <machine-name> watch -u <url> then it does not happen. Running the latest stable release, cargo 1.72.0. So the issue is probably not fixed. Have not investigated the backtraces because it all run in containers, not impossible to do this but not so easy. Starting to read the log once it is stuck does not help, have to kill the job and restart from scratch.

But overall maybe it makes sense to consistently go through all read/write operations that call read/write syscalls and make sure all of them have some sort of timeout. E.g. if cargo gets stuck trying to write to stdout for 120 seconds, it better crashes with an error. Same for these pipes with characters between processes, if the pipe is full for too long, better crash and go fix the size of the buffer.

@Eh2406
Copy link
Contributor

Eh2406 commented Sep 5, 2023

This is an old issue, if it is back to happening please open a new issue.

@link2xt
Copy link
Author

link2xt commented Sep 5, 2023

It still looks like the same issue to me: if nobody reads cargo output for a long time because the pipe where it writes to gets stuck for whatever reason (connection closed, running in CI etc.), cargo gets stuck without any CPU usage and does not get unstuck even when someone starts reading from the pipe.

This issue is not closed, so I don't think there is a need for a new issue like this one.

@epage epage added the S-triage Status: This issue is waiting on initial triage. label Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: bug S-triage Status: This issue is waiting on initial triage.
Projects
None yet
Development

No branches or pull requests

6 participants