Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pantsd] Address pantsd-runner hang on Linux and re-enable integration test. #4407

Merged
merged 6 commits into from
Mar 31, 2017

Conversation

kwlzn
Copy link
Member

@kwlzn kwlzn commented Mar 31, 2017

Problem

Currently, on Linux the first thin client call to the daemon can deadlock just after the pantsd->fork->pantsd-runner workflow. Connecting to the process with gdb reveals a deadlock in the following stack in the post_fork drop of the CpuPool:

#0  0x00007f63f04c31bd in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f63f04c0ded in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00007f63d3cfa438 in notify_one ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/condvar.rs:52
No locals.
#3  notify_one () at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/condvar.rs:39
No locals.
#4  notify_one () at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sync/condvar.rs:208
No locals.
#5  std::thread::{{impl}}::unpark () at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/thread/mod.rs:633
No locals.
#6  0x00007f63d3c583d1 in crossbeam::sync::ms_queue::{{impl}}::push<futures_cpupool::Message> (self=<optimized out>, t=...)
    at /home/kwilson/.cache/pants/rust-toolchain/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.10/src/sync/ms_queue.rs:178
        guard = <optimized out>
        self = <optimized out>
#7  0x00007f63d3c588ed in futures_cpupool::{{impl}}::drop (self=<optimized out>)
    at /home/kwilson/.cache/pants/rust-toolchain/git/checkouts/futures-rs-a4f11d094efefb0a/f7e6bc8/futures-cpupool/src/lib.rs:236
        self = 0x37547a0
#8  0x00007f63d3be871c in engine::fs::{{impl}}::post_fork (self=0x3754778) at /home/kwilson/dev/pants/src/rust/engine/src/fs.rs:355
        self = 0x3754778
#9  0x00007f63d3be10e4 in engine::context::{{impl}}::post_fork (self=0x37545b0)
    at /home/kwilson/dev/pants/src/rust/engine/src/context.rs:93
        self = 0x37545b0
#10 0x00007f63d3c0de5a in {{closure}} (scheduler=<optimized out>) at /home/kwilson/dev/pants/src/rust/engine/src/lib.rs:275
        scheduler = 0x3740580
#11 with_scheduler<closure,()> (scheduler_ptr=<optimized out>, f=...) at /home/kwilson/dev/pants/src/rust/engine/src/lib.rs:584
        scheduler = 0x3740580
        scheduler_ptr = 0x3740580
#12 engine::scheduler_post_fork (scheduler_ptr=0x3740580) at /home/kwilson/dev/pants/src/rust/engine/src/lib.rs:274
        scheduler_ptr = 0x3740580
#13 0x00007f63d3c1be8c in _cffi_f_scheduler_post_fork (self=<optimized out>, arg0=0x35798f0) at src/cffi/native_engine.c:2234
        _save = 0x34a65a0
        x0 = 0x3740580
        datasize = <optimized out>
#14 0x00007f63f07b5a62 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0

This presents as a hang in the thin client, because the pailgun socket is left open in the pantsd-runner.

Solution

Add pre-fork hooks and tear down the CpuPool instances prior to forking and rebuilding them.

Result

Can no longer reproduce the hang.

@@ -112,7 +112,8 @@ impl Entry {
let state =
match self.node.clone() {
EntryKey::Valid(n) => {
let pool = context_factory.pool();
let pool_opt = context_factory.pool();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line for this is probably fine.

Copy link
Member Author

@kwlzn kwlzn Mar 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not according to the compiler.

error: borrowed value does not live long enough
   --> src/rust/engine/src/graph.rs:115:88
    |
115 |             let pool = context_factory.pool().as_ref().expect("Uninitialized CpuPool!");
    |                        ---------------------- temporary value created here             ^ temporary value dropped here while still borrowed
...
122 |           },
    |           - temporary value needs to live until here
    |
    = note: consider using a `let` binding to increase its lifetime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Yea, makes sense. Thanks.

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -130,6 +130,11 @@ def run(self):
"""Fork, daemonize and invoke self.post_fork_child() (via ProcessManager)."""
self.daemonize(write_pid=False)

def pre_fork(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, post_fork is called in two places... does this need to be as well?

@kwlzn kwlzn force-pushed the kwlzn/linux_pantsd_hang branch from 207f8d6 to 65e5697 Compare March 31, 2017 02:25
@kwlzn kwlzn merged commit 177deb9 into pantsbuild:master Mar 31, 2017
lenucksi pushed a commit to lenucksi/pants that referenced this pull request Apr 25, 2017
…n test. (pantsbuild#4407)

### Problem

Currently, on Linux the first thin client call to the daemon can deadlock just after the pantsd->fork->pantsd-runner workflow. Connecting to the process with `gdb` reveals a deadlock in the following stack in the `post_fork` `drop` of the `CpuPool`:

```
#0  0x00007f63f04c31bd in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
pantsbuild#1  0x00007f63f04c0ded in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
pantsbuild#2  0x00007f63d3cfa438 in notify_one ()
    at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys/unix/condvar.rs:52
No locals.
pantsbuild#3  notify_one () at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sys_common/condvar.rs:39
No locals.
pantsbuild#4  notify_one () at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/sync/condvar.rs:208
No locals.
pantsbuild#5  std::thread::{{impl}}::unpark () at /buildslave/rust-buildbot/slave/stable-dist-rustc-linux/build/src/libstd/thread/mod.rs:633
No locals.
pantsbuild#6  0x00007f63d3c583d1 in crossbeam::sync::ms_queue::{{impl}}::push<futures_cpupool::Message> (self=<optimized out>, t=...)
    at /home/kwilson/.cache/pants/rust-toolchain/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.2.10/src/sync/ms_queue.rs:178
        guard = <optimized out>
        self = <optimized out>
pantsbuild#7  0x00007f63d3c588ed in futures_cpupool::{{impl}}::drop (self=<optimized out>)
    at /home/kwilson/.cache/pants/rust-toolchain/git/checkouts/futures-rs-a4f11d094efefb0a/f7e6bc8/futures-cpupool/src/lib.rs:236
        self = 0x37547a0
pantsbuild#8  0x00007f63d3be871c in engine::fs::{{impl}}::post_fork (self=0x3754778) at /home/kwilson/dev/pants/src/rust/engine/src/fs.rs:355
        self = 0x3754778
pantsbuild#9  0x00007f63d3be10e4 in engine::context::{{impl}}::post_fork (self=0x37545b0)
    at /home/kwilson/dev/pants/src/rust/engine/src/context.rs:93
        self = 0x37545b0
pantsbuild#10 0x00007f63d3c0de5a in {{closure}} (scheduler=<optimized out>) at /home/kwilson/dev/pants/src/rust/engine/src/lib.rs:275
        scheduler = 0x3740580
pantsbuild#11 with_scheduler<closure,()> (scheduler_ptr=<optimized out>, f=...) at /home/kwilson/dev/pants/src/rust/engine/src/lib.rs:584
        scheduler = 0x3740580
        scheduler_ptr = 0x3740580
pantsbuild#12 engine::scheduler_post_fork (scheduler_ptr=0x3740580) at /home/kwilson/dev/pants/src/rust/engine/src/lib.rs:274
        scheduler_ptr = 0x3740580
pantsbuild#13 0x00007f63d3c1be8c in _cffi_f_scheduler_post_fork (self=<optimized out>, arg0=0x35798f0) at src/cffi/native_engine.c:2234
        _save = 0x34a65a0
        x0 = 0x3740580
        datasize = <optimized out>
pantsbuild#14 0x00007f63f07b5a62 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
```

This presents as a hang in the thin client, because the pailgun socket is left open in the pantsd-runner.

### Solution

Add pre-fork hooks and tear down the `CpuPool` instances prior to forking and rebuilding them.

### Result

Can no longer reproduce the hang.
asherf added a commit to asherf/pants that referenced this pull request Sep 11, 2021
https://python-poetry.org/history/#118---2021-08-19

Fixed an error with repository prioritization when specifying secondary repositories. (pantsbuild#4241)
Fixed the detection of the system environment when the setting virtualenvs.create is deactivated. (pantsbuild#4330, pantsbuild#4407)
Fixed the evaluation of relative path dependencies. (pantsbuild#4246)
Fixed environment detection for Python 3.10 environments. (pantsbuild#4387)
Fixed an error in the evaluation of in/not in markers (python-poetry/poetry-core#189)
@asherf asherf mentioned this pull request Sep 11, 2021
stuhood pushed a commit that referenced this pull request Sep 13, 2021
https://python-poetry.org/history/#118---2021-08-19

Fixed an error with repository prioritization when specifying secondary repositories. (#4241)
Fixed the detection of the system environment when the setting virtualenvs.create is deactivated. (#4330, #4407)
Fixed the evaluation of relative path dependencies. (#4246)
Fixed environment detection for Python 3.10 environments. (#4387)
Fixed an error in the evaluation of in/not in markers (python-poetry/poetry-core#189)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants