Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobserver hangs due to full pipe #9739

Closed
jorisgio opened this issue Jul 28, 2021 · 32 comments
Closed

Jobserver hangs due to full pipe #9739

jorisgio opened this issue Jul 28, 2021 · 32 comments
Labels
A-jobserver Area: jobserver, concurrency, parallelism C-bug Category: bug

Comments

@jorisgio
Copy link

jorisgio commented Jul 28, 2021

Problem

When running cargo build on a machine with 128cores/256HT for CI running on debian with linux 5.8, builds regularly hang.
All rustc process and the cargo process each have a thread trying to write "|" to what i think is the jobserver pipe, but the pipe is full, so the build hangs.
Increasing manually the pipe buffer size in C via /proc/pid/fd/ unblocks everything and the build finishes

Steps
Not sure tbh, besides running big amount of jobs

Possible Solution(s)

Notes
Currently only reproduced on cargo 1.50, but trying to upgrade everything to 1.54 to check

here is one backtrace from cargo

#0  __libc_write (nbytes=1, buf=0x7ffdf74b850f, fd=6) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __libc_write (fd=6, buf=0x7ffdf74b850f, nbytes=1) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x000055add96cd5a6 in std::sys::unix::fd::FileDesc::write () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/fd.rs:146
#3  std::sys::unix::fs::File::write () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/fs.rs:845
#4  <&std::fs::File as std::io::Write>::write () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/fs.rs:679
#5  0x000055add969f5d7 in jobserver::imp::Client::release ()
#6  0x000055add96a0abd in <jobserver::Acquired as core::ops::drop::Drop>::drop ()
#7  0x000055add90134fe in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#8  0x000055add8fcccd3 in std::panic::catch_unwind ()
#9  0x000055add8f3bafe in crossbeam_utils::thread::scope ()
#10 0x000055add9011856 in cargo::core::compiler::job_queue::JobQueue::execute ()
#11 0x000055add8ec3a15 in cargo::core::compiler::context::Context::compile ()
#12 0x000055add914bd01 in cargo::ops::cargo_compile::compile_ws ()
#13 0x000055add914ba5e in cargo::ops::cargo_compile::compile ()
#14 0x000055add8dbd67d in cargo::commands::build::exec ()
#15 0x000055add8d5f9f9 in cargo::cli::main ()
#16 0x000055add8dc7438 in cargo::main ()
#17 0x000055add8db6333 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#18 0x000055add8db6359 in std::rt::lang_start::{{closure}} ()
#19 0x000055add96df177 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once ()
    at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/core/src/ops/function.rs:259
#20 std::panicking::try::do_call () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panicking.rs:379
#21 std::panicking::try () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panicking.rs:343
#22 std::panic::catch_unwind () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/panic.rs:396
#23 std::rt::lang_start_internal () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/rt.rs:51
#24 0x000055add8dc9c82 in main () 

And here is an example from rustc

(gdb) info thread
  Id   Target Id                                 Frame 
* 1    Thread 0x7f64866604c0 (LWP 85722) "rustc" 0x00007f648b639495 in __GI___pthread_timedjoin_ex (threadid=140069719504640, thread_return=0x0, abstime=0x0, 
    block=<optimized out>) at pthread_join_common.c:89
  2    Thread 0x7f6485dff700 (LWP 85806) "rustc" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7f646517f700 (LWP 90268) "rustc" futex_wait_cancelable (private=0, expected=0, futex_word=0x7f647b40d30c)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  4    Thread 0x7f6464bff700 (LWP 90269) "rustc" __libc_write (nbytes=1, buf=0x7f6464bfc247, fd=6) at ../sysdeps/unix/sysv/linux/write.c:26
(gdb) thread 4
[Switching to thread 4 (Thread 0x7f6464bff700 (LWP 90269))]
#0  __libc_write (nbytes=1, buf=0x7f6464bfc247, fd=6) at ../sysdeps/unix/sysv/linux/write.c:26
26      ../sysdeps/unix/sysv/linux/write.c: No such file or directory.
(gdb) bt
#0  __libc_write (nbytes=1, buf=0x7f6464bfc247, fd=6) at ../sysdeps/unix/sysv/linux/write.c:26
#1  __libc_write (fd=6, buf=0x7f6464bfc247, nbytes=1) at ../sysdeps/unix/sysv/linux/write.c:24
#2  0x00007f648b6e6a26 in std::sys::unix::fd::FileDesc::write () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/fd.rs:146
#3  std::sys::unix::fs::File::write () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/fs.rs:845
#4  <&std::fs::File as std::io::Write>::write () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/fs.rs:679
#5  0x00007f648f0739f9 in <jobserver::Acquired as core::ops::drop::Drop>::drop ()
   from /home/user/.rustup/toolchains/1.50.0-x86_64-unknown-linux-gnu/bin/../lib/librustc_driver-02bb148e88292f22.so
#6  0x00007f648e4aefa9 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
   from /home/user/.rustup/toolchains/1.50.0-x86_64-unknown-linux-gnu/bin/../lib/librustc_driver-02bb148e88292f22.so
#7  0x00007f648e4a9b32 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
   from /home/user/.rustup/toolchains/1.50.0-x86_64-unknown-linux-gnu/bin/../lib/librustc_driver-02bb148e88292f22.so
#8  0x00007f648b71256a in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once ()
    at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/alloc/src/boxed.rs:1328
#9  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once ()
    at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b/library/alloc/src/boxed.rs:1328
#10 std::sys::unix::thread::Thread::new::thread_start () at /rustc/cb75ad5db02783e8b0222fee363c5f63f7e2cf5b//library/std/src/sys/unix/thread.rs:71
#11 0x00007f648b637fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#12 0x00007f648b5574cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
@jorisgio jorisgio added the C-bug Category: bug label Jul 28, 2021
@alexcrichton
Copy link
Member

Oh dear, thanks for the report! This is definitely something we need to fix...

Do you know the capacity of the pipe by default on your system? (basically is it less than 256?) If the pipe capacity is greater than 256 I'm not sure how else this could be coming about, but if it's less than 256 then I can imagine this happening because Cargo at least will write tokens back to the pipe and then try to read them off again for later scheduling purposes. This would introduce a deadlock where Cargo, who can read tokens, is also waiting to write tokens.

Poking around it looks like we can use F_SETPIPE_SZ with fcntl on Linux to ensure that the pipe capacity is larger than the number of tokens that we want to write. I suspect that would be the fix for this (probably in the jobserver crate). That being said the first thing creating a jobserver does is to write everything into the pipe, so if that didn't block then this may be indicative of another bug where extra tokens are being added to the pipe by accident.

@jorisgio
Copy link
Author

jorisgio commented Jul 28, 2021

Actually you are right, i don't see how pipe can be full with only 256 elements, on linux it is allocated page by page , so minimum would be 4k, doc says it is supposed to be 16 pages, so 64k, but when i write a small c tool to call fctnl on proc file system, it reports 4096. Migh be debian, might the fact it runs on systemd-nspawn container.
I sigstopped everything, and used cat to dump the pipe buffer :
|||||||||||||||||||||||||||||||||||||+||||+||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
here is an example, it is stuck with only 105 chars in buffer. So it should not be full. Yet, increasing the pipe size with fcntl on proc filesystem allows the build to continue and finish, and i don't see how write can block if the pipe is not full. So my explaination is not totally accurate as it seems.

If you have any idea how to dig it further, i can try.

@alexcrichton
Copy link
Member

Oh dear, that sounds bad!

To confirm, you're not trying to use the unstable parallel rustc, right? Also, do you have any build scripts in play which use make or something like that which shares the jobserver with Rust? (AFAIK Rust only writes | but it looks like there's a + on the pipe so there may be something else in play).

Reading over the pipe(7) page again it reinforces that the only way to block in writing is if the pipe is full. Given that the jobserver was actually successfully created (which involves writing everything initially) it seems like the pipe will never be full...

When the system hangs, are you sure that the threads are all blocked? In that they're not like busy-spinning hogging CPU writing things to pipes?

Also, would you be able to share the project that reproduces this (perhaps just privately with me). I've got a 40/80 core system I can test on which I may be able to stress enough to reproduce this perhaps?

@jorisgio
Copy link
Author

jorisgio commented Jul 28, 2021

To confirm, you're not trying to use the unstable parallel rustc, right?

No, this is rust 1.50 stable release installed with rustup (quite old i know, i'm trying to upgrade stuffs to 1.54 but it takes time to reproduce).

Also, do you have any build scripts in play which use make or something like that which shares the jobserver with Rust?

a bash scripts calls make that just calls cargo build --release. I don't see the pipe in make process, but i didn't try to share anything at least if it is not done by default.
Should be noted that this setup has been working for 4 years, with almost no changes but CI it started happening when CI was migrated from docker to systemd nspawn and some intel 16 cores machines to amd epyc with 128 cores. So one or the other is the culprit, unless it is linux 4.19 -> 5.8. Hard to know since everything changed at once, but those are the major changes.

When the system hangs, are you sure that the threads are all blocked? In that they're not like busy-spinning hogging CPU writing things to pipes?

strace of rustc

strace: Process 105965 attached with 4 threads
[pid 105973] write(6, "|", 1 <unfinished ...>
[pid 105972] futex(0x7f5c64282d7c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 105971] futex(0x7f5c694df898, FUTEX_WAIT_PRIVATE, 4294967295, NULL <unfinished ...>
[pid 105965] futex(0x7f5c68fff9d0, FUTEX_WAIT, 105971, NULL

strace of cargo

user@buildbot-ci010-nspawn-023:~$ sudo strace -p 94692 -ff 
strace: Process 94692 attached with 8 threads
[pid 105964] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 105626] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 105552] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 105181] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 101411] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 101410] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 94696] futex(0x559cfd19de2c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 94692] write(6, "|", 1

Also, would you be able to share the project that reproduces this (perhaps just privately with me). I've got a 40/80 core system I can test on which I may be able to stress enough to reproduce this perhaps?

i will ask around

@alexcrichton
Copy link
Member

Oh no worries on 1.50 vs 1.54, I'm just trying to rule out some more esoteric possibilities to make sure this is as weird as I think it is. AFAIK nothing really substantive around this changed between 1.50 and 1.54 so upgrading will likely not show a difference.

My guess is that the issue here stems from the very-large amount of parallellism with 256 HT cores. You may be one of the first people to run Cargo/rustc on such parallelism!

Although if you have make calling Cargo that does introduce another actor in the system. Do you invoke make with any -j flags? Also, do you know if Cargo is configured to share the jobserver from make?

@jorisgio
Copy link
Author

I don't think so, make is not called with -j, just plain make, that builds rust code first then some ocaml code, just a wrapper, but i'll try to remove it make from the equation.
This is the (probably outdated) Cargo.toml configuration of the workspace. There is no config.toml, neither locally nor in user home.

[profile.release]
debug = true
debug-assertions = true
overflow-checks = true
panic = 'abort'
lto = 'thin'
opt-level = 3
codegen-units = 8

[profile.dev]
panic = 'abort'

And the only running make process does not share any file descriptor with its cargo child. I don't really see how to configure cargo to communicate with make job server. And the environment of cargo regarding to make is :
MAKEFLAGS=MAKELEVEL=1MAKE_TERMERR=/dev/pts/1MAKE_TERMOUT=/dev/pts/1MFLAGS=

@ehuss
Copy link
Contributor

ehuss commented Jul 28, 2021

FYI, + is the token used by make, and | is the token used by the Rust jobserver client.

@alexcrichton
Copy link
Member

Hm ok, so it sounds like Cargo is the one making the jobserver and it's not dealing with the parent make at all. My suspicion about + in the pipe being from make though seems to be confirmed by @ehuss, so I don't think that we can rule out make just yet. Cargo lets build scripts inherit the jobserver so any sub-make spawned by a build script also can use it. I suspect that's happening for at least one of your crates since + is in the pipe.

I'm not sure that make would cause an issue like this, but is it possible to remove it from the equation? It seems like there's a lot of variables in play here so just trying to slim down what we're looking at.

@jorisgio
Copy link
Author

Ah yes, sub-make is definitely possible, there are a lot of -sys crates. I'll see if i can play with bpftrace to try to find where + is coming.

@ehuss ehuss added the A-jobserver Area: jobserver, concurrency, parallelism label Jul 28, 2021
@jorisgio
Copy link
Author

jorisgio commented Jul 29, 2021

I know believe it is not an issue with cargo. So as you said, as long as pipe buffer is larger than core count, it should never deadlock. Except, it seems linux pipe buffer has some perculiarity. Apparently, it is a ring buffer of pages, with two paths :

  • one for short writes, that tries to append on the last non full memory page
  • one for large writes, that writes pages by pages or something like that
    https://elixir.bootlin.com/linux/latest/source/fs/pipe.c#L442
    The short path is only disabled if the pipe has O_DIRECT set, which does not seem to be the case. The problem is, the short path does not seem to "ring" on a single page, the ring buffer has page granularity for looping.
    Now, i confirmed that the doc is indeed correct, modern linux default pipe buffer capacity is 16 pages.
    Confirmed also on the host and containers that creating a pipe is 16 pages. But for some reason, the pipe of hanged build is only 4096 bytes, so one page. In this case, even if the beginning is free, and the buffer could only contain a single byte, the buffer will not loop, and no new page is allowed to be allocated, so write will block. I think this is what is happening.

But i ran the build on my personal machines, and the pipe buffer is 64k all along the build, also strace everything, and no call to fcntl to change the pipe size happened. Except on this ci systemd containers, it is 4k, and so far i have not being able to find what reduces the pipe buffer, bpftrace shows no call to the kernel pipe_fcntl function that would reduce the size. I suspect it is hitting some unpriviledged user ressource inside container, which has the effect to truncate the pipe capacity of existing pipe to minimum, or something like this https://elixir.bootlin.com/linux/latest/source/fs/pipe.c#L799.
The tldr is that something fishy is happening, and it does not seem to be inside cargo or rustc

@alexcrichton
Copy link
Member

Ooh the plot thickens!

So I can actually pretty easily reproduce the deadlock you're seeing with this:

use std::io;

const N: usize = 10_000;

fn main() {
    unsafe {
        let mut a = [0; 2];
        if libc::pipe(a.as_mut_ptr()) != 0 {
            panic!("err: {}", io::Error::last_os_error());
        }
        println!("init buffer: {}", libc::fcntl(a[1], libc::F_GETPIPE_SZ));
        println!("new buffer:  {}", libc::fcntl(a[1], libc::F_SETPIPE_SZ, 0));

        for _ in 0..80 {
            assert_eq!(libc::write(a[1], b"a".as_ptr() as *const _, 1), 1);
        }

        let threads = (0..80)
            .map(|_| {
                std::thread::spawn(move || {
                    for _ in 0..N {
                        let mut b = [0u8];
                        assert_eq!(libc::read(a[0], b.as_mut_ptr() as *mut _, 1), 1);
                        assert_eq!(libc::write(a[1], b"a".as_ptr() as *const _, 1), 1);
                    }
                })
            })
            .collect::<Vec<_>>();

        for t in threads {
            t.join().unwrap();
        }
    }
}

If the line with F_SETPIPE_SZ is commented out or if it's set to 8192 instead of 0 (which rounds to 4096) then this completes just fine.

I think that the fix here is probably to document this ring behavior with pages in the jobserver crate and to explicitly increase the pipe buffer if the initial pipe buffer is just one page?

@jorisgio
Copy link
Author

I think that the fix here is probably to document this ring behavior with pages in the jobserver crate and to explicitly increase the pipe buffer if the initial pipe buffer is just one page?

Indeed, i think it would be a good idea for linux systems. If it cannot increase the buffer due to user limit, it will probably fail, but then i guess it is much easier to understand what is going on while getting an error. As for me, i've increased /proc/sys/fs/pipe-user-pages-soft, so that right now all pipes stay at the default of 16 pages, and no build hanged since then.

alexcrichton added a commit to rust-lang/jobserver-rs that referenced this issue Jul 29, 2021
This commit attempts to address rust-lang/cargo#9739 by optionally
increasing the capacity of pipes created on Linux. This may be required
if the pipe initially starts out with only one page of buffer which
apparently means that normal jobserver usage might cause the pipe to
deadlock where all writers are blocked.
@alexcrichton
Copy link
Member

Oh were you unable to get the pipe capacity to increase using F_SETPIPE_SZ? If that's the case then I think I'd need to make a better error message for the jobserver crate to explain what's going on, or it may means that the jobserver crate just has to fail instead of trying to increase its capacity.

FWIW I posted rust-lang/jobserver-rs#34 which I thought might fix things, but if F_SETPIPE_SZ would fail for you then it wouldn't do anything other than just fail sooner.

@jorisgio
Copy link
Author

#include <stdio.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
  int ret;
  size_t total_pipe_size = 0;
  int pipes = 0;
  int fds[2]; // leak fds who cares;
  while(1) {
    ret = pipe(fds);
    if (ret) {
      printf("cannot create pipe %s", strerror(errno));
      exit(1);
    }
    pipes ++;
    ret = fcntl(fds[0], F_SETPIPE_SZ, 1000000);
    if (ret <= 0) {
      printf("cannot increase pipe buffer for pipe %d after %lu : %s", pipes, total_pipe_size, strerror(errno));
      exit(1);
    }
    total_pipe_size += ret;
  }
  return 0;
}

I didn't fully reproduce the truncation of default value, but when the limit is reached, F_SETPIPE_SZ will return EPERM. So probably a good idea to have a nice error message here

@alexcrichton
Copy link
Member

Ah ok, so does that mean you can increase the size of some pipes, just not infinitely? (that makes sense because presumably the kernel is limited in resources).

To confirm, though, you can successfully increase the size of at least one pipe? If so that means my PR should fix the issue for you without having you to reconfigure the system limits.

@jorisgio
Copy link
Author

jorisgio commented Jul 29, 2021

Yes and Yes. Actually full example of what is going on :

#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char **argv) {
  int ret;

  int fd = open("/proc/sys/fs/pipe-user-pages-soft", O_RDONLY);
  char buf[64];
  ret = read(fd, buf, 64);
  close(fd);
  buf[ret] = '\0';
  int limit = atoi(buf);

  int fds[2]; // leak fds who cares;
  for (int i = 0; i < limit / 16; ++i) {
    ret = pipe(fds);
    if (ret) {
      printf("cannot create pipe %s", strerror(errno));
      exit(1);
    }

    ret = fcntl(fds[0], F_GETPIPE_SZ);
    printf("pipe %d buffer size %d\n", i, ret);
  }
  ret = pipe(fds);
  if (ret) {
    printf("cannot create last pipe %s", strerror(errno));
    exit(1);
  }
  ret = fcntl(fds[0], F_GETPIPE_SZ);
  printf("last pipe buffer %d\n", ret);
  ret = fcntl(fds[0], F_SETPIPE_SZ, 16 * 4096);
  if (ret <= 0) {
    printf("cannot increase last pipe buffer : %s\n",
           strerror(errno));
    exit(1);
  }
  return 0;
}

after some limit you can still create pipes, but buffer is limited to one page, so creating pipe works, but increasing size returns error, but initially it works until this limit is reached. So your PR would basically just make cargo fail, not fix the issue, but i guess at least it can explain what is going on. It would also help in case default pipe size is one page, but i'm not sure there exists such a linux configuration. You probably need to recompile kernel.

@alexcrichton
Copy link
Member

Hm ok so to confirm I'm understanding, Linux, after enough pipes are created, reduces new pipes' default capacity from 16 pages to 1 page?

Is that how you were seeing this issue? Enough pipes were created such that new pipes defaulted to 1 page?

@jorisgio
Copy link
Author

Is that how you were seeing this issue? Enough pipes were created such that new pipes defaulted to 1 page?

Yes, this is the issue. So in this case, unless some pipes were destroyed in the tiny time between pipe creation and now, ncreasing pipe buffer will likely fail anyway.

@alexcrichton
Copy link
Member

Ok that's good to know, and means I probably won't land that PR on jobserver, or at least not exactly as-is.

Does pipe-user-pages-soft apply to the whole system or per-process? Do you know what was creating so many more pipes than before?

@jorisgio
Copy link
Author

jorisgio commented Jul 29, 2021

As far as i understand, it applies per user, or might be to all unpriviledged users, i'm not so sure.

Do you know what was creating so many more pipes than before?

Probably related to the fact this machine has unusual load. It is big servers running 25 containers running CI builds concurrently, not only rust. It compiles a lot of ocaml code, and ocaml is a "traditional" compiler in the sense it pipes output a lot, eg to external assembler, for preprocessor, etc. So it is not unexpected that it would create more than 1024 pipes concurrently.
Regarding to "than before", previously the CI was setup to run on 5 servers with 16 cores, while now it is a single server handling the same load, so that's the change. Same number of forks, 5 times lower pipe buffer

alexcrichton added a commit to rust-lang/jobserver-rs that referenced this issue Jul 29, 2021
This commit attempts to address rust-lang/cargo#9739 by optionally
increasing the capacity of pipes created on Linux. This may be required
if the pipe initially starts out with only one page of buffer which
apparently means that normal jobserver usage might cause the pipe to
deadlock where all writers are blocked.
@jorisgio
Copy link
Author

@alexcrichton thanks a lot for your help on this. Regarding your comment on your last commit we also discussed this internally and although I don't think it is per se a bug as it seems to be allowed by pipe semantic it seems to be a weird behavior of Linux pipes. Do you think we should raise that upstream?

@alexcrichton
Copy link
Member

Personally I am indeed quite surprised that my Rust program above deadlocks. That sort of feels like a bug in the kernel but I don't have the capacity myself to report upstream (or to know whether it's even reasonable to report upstream).

If y'all are willing though it's probably not a bad idea to report? My guess is that if Cargo is hitting this deadlock then make with high parallelism would also run the risk of triggering the deadlock (although not for sure since Cargo/Make's usage are slightly different)

@Hello71
Copy link

Hello71 commented Aug 2, 2021

checking /proc/x/stack, it is hanging at https://github.com/torvalds/linux/blob/master/fs/pipe.c#L560. checking comments, it looks like not only there needs to be space in the pipe, but also slots. at https://github.com/torvalds/linux/blob/master/fs/pipe.c#L1306, a one-page pipe has only one slot. this led me to test nthreads=2, which also hangs. checking blame of the pipe_write comment, it was added in torvalds/linux@a194dfe. it says:

We just abandon the preallocated slot if we get a copy error. Future
writes may continue it and a future read will eventually recycle it.

so i guess there is some race, causing the slot to be abandoned. normally this is not an issue, because some reader should always be queued. in this case, there is no reader. this is an incorrect use of pipes, see https://man7.org/linux/man-pages/man7/pipe.7.html:

Applications should not rely on a particular capacity: an application should be designed so that a reading process consumes data as soon as it is available, so that a writing process does not remain blocked.

nonetheless, considering the strong emphasis placed on kernel backwards compatibility, it could be worth reporting on the linux mailing list.

@alexcrichton
Copy link
Member

If you need any information from me about how Cargo uses the jobserver and/or pipes please let me know! Otherwise personally I'm afraid of just being berated for incorrectly using pipes so I don't think I'll try to post myself.

alexcrichton added a commit to alexcrichton/cargo that referenced this issue Aug 4, 2021
This should help mitigate rust-lang#9739 at least with a better error message as
opposed to deadlocking.
@Hello71
Copy link

Hello71 commented Aug 4, 2021

I posted on the Linux kernel mailing list: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/. Of potential note to Rust is my comment at the end: registering a SIGCHLD handler with SA_RESTART likely works around this issue in GNU make. Using a two-page pipe buffer also has a high probability of working around the issue, but has a problem: the one-page buffer size is only selected in the first place because the user has exceeded their pipe buffer quota. I believe in this circumstance, unless the user happens to close some pipes between the pipe creation and resizing, the resize will be denied.

@alexcrichton
Copy link
Member

Thanks for that! That also seems like a reasonable solution to me, yeah, and I agree that the "fix" in the jobserver crate to increase the capacity is unlikely to ever work, so the main purpose of that patch was mostly to get a better error message (or an error message at all).

bors added a commit that referenced this issue Aug 5, 2021
Bump to the latest jobserver dependency

This should help mitigate #9739 at least with a better error message as
opposed to deadlocking.
@Hello71
Copy link

Hello71 commented Aug 6, 2021

My fix for this issue has been commited upstream: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46c4c9d1beb7f5b4cec4dd90e7728720583ee348. Assuming there are no issues, it should be in the final 5.14 release, as well as stable releases back to 4.4.

Since this issue only occurs when the user has a very high number of pipes open simultaneously (at least 1024), and it can be worked around by increasing the value of fs.pipe-user-pages-soft, I don't think any fix is necessary on the Rust side. Since jobserver seems to be applicable for arbitrary uses of tokening, not just starting processes, it would not be appropriate to register a SIGCHLD handler in jobserver; the handler would need to be installed in cargo and all other users of jobserver.

Strictly speaking, to avoid this bug, the buffer size need only be increased to at least two pages (which is what the kernel fix does). However, rust-lang/jobserver-rs@865b7e3 also works. Note that the code is still technically incorrect for non-Linux platforms, as there is no guaranteed pipe buffer size or behavior. In practice, I think a minimum buffer size of 1000 bytes is likely to function on all Unix-like platforms, barring basically-a-bug behavior like Linux's here.

A fully correct implementation would always have at least one thread either currently or imminently reading on the pipe. However, as I explained in my (very long) email, GNU make exhibits the same behavior in edge cases, so I don't think it's something to be overly concerned about.

@alexcrichton
Copy link
Member

Oh wow that's awesome, thanks for the update @Hello71!

@alexcrichton
Copy link
Member

I think this is probably sufficiently handled now, so I'm going to close this, but if there are more updates here please feel free to post them!

@Hello71
Copy link

Hello71 commented Aug 6, 2021

final note: https://lore.kernel.org/linux-fsdevel/CAHk-=wgE4XwB2E0=CYB8eqss6WB1+FXgVWZRsUUSWTOeq-kh8w@mail.gmail.com/ points out that even after the patch, you can still write 4097 bytes to a 8192 byte buffer and fail to write 1 more byte, but after the change, pipes should be able to hold at least 4096 bytes in all cases. so theoretically rust-lang/jobserver-rs@865b7e3 should take the requested size and then add one page (not 4096 bytes on some platforms!), but in practice i doubt people will try to run more than 4096 simultaneous jobs, and it might not work on other platforms (as linus says, posix only requires 512 bytes)

@ehuss
Copy link
Contributor

ehuss commented Aug 17, 2021

The change to have the jobserver resize the buffer was reverted in #9798 because it was causing some breakage (see rust-lang/rust#88091).

Alex outlined some alternatives here: rust-lang/rust#88091 (comment)

I'm not sure if it is worth re-opening this or if there are any realistic options.

@jorisgio
Copy link
Author

I am not sure what would be a proper options. As reported by the other person hitting pipe limit, there are different tradeoff in different situations, i believe linux changes are enough. And from my side of things, i think this issues that is easy to google should already be enough if someone hits the same corner case so i would say you guys can keep things as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobserver Area: jobserver, concurrency, parallelism C-bug Category: bug
Projects
None yet
Development

No branches or pull requests

4 participants