You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem is highly random, and I saw this problem on only one AArch64 server.
I reproduced the problem and debugged with GDB, here is some observation: gdb.txt
##################### Backtrace of cargo process: #####################
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container.
__libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
26 ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) info threads
Id Target Id Frame
* 1 Thread 0xffffa2264ac0 (LWP 665544) "cargo" __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
2 Thread 0xffff9dea8030 (LWP 666949) "cargo" futex_wait_cancelable (private=0, expected=0, futex_word=0xaaab143ff6dc) at ../sysdeps/nptl/futex-internal.h:183
3 Thread 0xfffeb5dce030 (LWP 671236) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb5dccfb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
4 Thread 0xfffeb0da6030 (LWP 673261) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb0da4fb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
5 Thread 0xfffeb05a2030 (LWP 673755) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb05a0fb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
6 Thread 0xfffeb03a1030 (LWP 673756) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb039ffb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
7 Thread 0xfffeb01a0030 (LWP 673852) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeb019efb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
8 Thread 0xfffeaff9f030 (LWP 673932) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeaff9dfb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
9 Thread 0xfffeafd9e030 (LWP 673992) "cargo" 0x0000ffffa2086128 in __GI___poll (fds=0xfffeafd9cfb0, nfds=2, timeout=<optimized out>)
at ../sysdeps/unix/sysv/linux/poll.c:41
(gdb) bt
#0 __libc_write (nbytes=1, buf=0xfffffcee5a7c, fd=<optimized out>) at ../sysdeps/unix/sysv/linux/write.c:26
#1 __libc_write (fd=<optimized out>, buf=0xfffffcee5a7c, nbytes=1) at ../sysdeps/unix/sysv/linux/write.c:24
#2 0x0000aaaad282df14 in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:146
#3 std::sys::unix::fs::File::write () at library/std/src/sys/unix/fs.rs:845
#4 <&std::fs::File as std::io::Write>::write () at library/std/src/fs.rs:680
#5 0x0000aaaad2813b34 in jobserver::imp::Client::release () at library/std/src/panicking.rs:541
#6 0x0000aaaad2814dd0 in <jobserver::Acquired as core::ops::drop::Drop>::drop () at library/std/src/panicking.rs:541
#7 0x0000aaaad21b9da0 in cargo::core::compiler::job_queue::DrainState::drain_the_queue () at library/std/src/panicking.rs:541
#8 0x0000aaaad21f9460 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once () at library/std/src/panicking.rs:541
#9 0x0000aaaad21fe4cc in crossbeam_utils::thread::scope () at library/std/src/panicking.rs:541
#10 0x0000aaaad21b8c9c in cargo::core::compiler::job_queue::JobQueue::execute () at library/std/src/panicking.rs:541
#11 0x0000aaaad2411b3c in cargo::core::compiler::context::Context::compile () at library/std/src/panicking.rs:541
#12 0x0000aaaad23692e8 in cargo::ops::cargo_compile::compile_ws () at library/std/src/panicking.rs:541
#13 0x0000aaaad23690f4 in cargo::ops::cargo_compile::compile () at library/std/src/panicking.rs:541
#14 0x0000aaaad23f44a0 in cargo::ops::cargo_test::run_tests () at library/std/src/panicking.rs:541
#15 0x0000aaaad208a1cc in cargo::commands::test::exec () at library/std/src/panicking.rs:541
#16 0x0000aaaad2081abc in cargo::cli::main () at library/std/src/panicking.rs:541
#17 0x0000aaaad208ab6c in cargo::main () at library/std/src/panicking.rs:541
#18 0x0000aaaad20d26f4 in std::sys_common::backtrace::__rust_begin_short_backtrace () at library/std/src/panicking.rs:541
#19 0x0000aaaad208c860 in std::rt::lang_start::{{closure}} () at library/std/src/panicking.rs:541
#20 0x0000aaaad283ca7c in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#21 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#22 std::panicking::try () at library/std/src/panicking.rs:365
#23 std::panic::catch_unwind () at library/std/src/panic.rs:434
#24 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#25 0x0000aaaad208c608 in main () at library/std/src/panicking.rs:541
(gdb) l
21 in ../sysdeps/unix/sysv/linux/write.c
(gdb)
############# Some rustc processes were also seen: ##########################
################## I checked the backtrace of some of them, all like: ###################
(gdb) attach 673941
Attaching to process 673941
[New LWP 673988]
[New LWP 674062]
[New LWP 674063]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container.
__pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145 pthread_join_common.c: No such file or directory.
(gdb) bt
#0 __pthread_clockjoin_ex (threadid=281473285418112, thread_return=0x0, clockid=0, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1 0x0000ffff9b63832c in std::sys::unix::thread::Thread::join () at library/std/src/sys/unix/thread.rs:177
#2 0x0000ffff9c2284bc in std::thread::JoinHandle<T>::join ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#3 0x0000ffff9c241688 in rustc_interface::util::setup_callbacks_and_run_in_thread_pool_with_globals ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#4 0x0000ffff9c295d3c in rustc_driver::RunCompiler::run ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#5 0x0000ffff9c21fce8 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once ()
from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#6 0x0000ffff9c29bb0c in rustc_driver::main () from target:/usr/local/rust/toolchains/1.54.0-aarch64-unknown-linux-gnu/bin/../lib/librustc_driver-61d3bce3fadaadc1.so
#7 0x0000aaaaba0e0c30 in rustc_main::main ()
#8 0x0000aaaaba0e0be4 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#9 0x0000aaaaba0e0bfc in std::rt::lang_start::{{closure}} ()
#10 0x0000ffff9b62da04 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/ops/function.rs:259
#11 std::panicking::try::do_call () at library/std/src/panicking.rs:401
#12 std::panicking::try () at library/std/src/panicking.rs:365
#13 std::panic::catch_unwind () at library/std/src/panic.rs:434
#14 std::rt::lang_start_internal () at library/std/src/rt.rs:34
#15 0x0000aaaaba0e0c5c in main ()
(gdb)
This looks like it's blocked in write which looks like it's running into #9739 which can be summarized that due to kernel behavior Cargo (and make and/or other processes) can deadlock under high load when there's a lot of in-memory pipes created by the kernel. This should be fixed in upstream Linux itself at this point but will likely take a long time to propagate. I believe another fix is to increase user pipe limits for your build machine if you can.
I tried the Rust source code you shared in #9739, the buffer size showed: "init buffer: 65536". And the file "/proc/sys/fs/pipe-user-pages-soft" content is "16384". Seemingly the pipe size is not very small.
Yes that kernel patch should fix the issue if it's the one I'm thinking of (I'm not sure what else would cause Cargo to be blocked in write on the jobserver pipe). If you run the program from #9739 it may not be accurate because the bug you're seeing only happens when the system is under heavy load and there's lots of pipes in the system. There was a lot of discussion and programs from #9739 as well though so it sort of depends...
If you can increase /proc/sys/fs/pipe-user-pages-soft or a similar setting and see if that fixes the issue that would likely pinpoint this as there's nothing really that Cargo can do about this, it's a kernel level thing we don't really have control over. You could try using -j1 and that may fix things but that's a bit of a bummer too.
Problem
cargo test
randomly hang inCloud Hypervisor
AArch64 unit test building stage.Here is a recorded failing job in our CI: https://cloud-hypervisor-jenkins.westus.cloudapp.azure.com/blue/organizations/jenkins/cloud-hypervisor/detail/PR-3236/4/pipeline/248/
cargo version: cargo 1.55.0 (32da73a 2021-08-23)
kernel version: 5.10.0
The problem is highly random, and I saw this problem on only one AArch64 server.
I reproduced the problem and debugged with GDB, here is some observation:
gdb.txt
##################### Backtrace of cargo process: #####################
############# Some rustc processes were also seen: ##########################
################## I checked the backtrace of some of them, all like: ###################
Steps
Possible Solution(s)
No response
Notes
No response
Version
The text was updated successfully, but these errors were encountered: