Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust panic in remote::tests::wait_between_request_3_retry #6100

Closed
kwlzn opened this issue Jul 12, 2018 · 3 comments
Closed

rust panic in remote::tests::wait_between_request_3_retry #6100

kwlzn opened this issue Jul 12, 2018 · 3 comments
Assignees

Comments

@kwlzn
Copy link
Member

kwlzn commented Jul 12, 2018

on the heels of improved logging via #6060, I noted the following panic during test execution in travis today:

test remote::tests::successful_execution_after_four_getoperations ... ok
test remote::tests::wait_between_request_1_retry ... ok
test remote::tests::timeout_after_sufficiently_delayed_getoperations ... ok
thread 'remote::tests::wait_between_request_3_retry' panicked at 'called `Result::unwrap()` on an `Err` value: "Cancelled: \"Cancelled\""', libcore/result.rs:945:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:463
   5: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:350
   6: rust_begin_unwind
             at libstd/panicking.rs:328
   7: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   8: core::result::unwrap_failed
             at /checkout/src/libcore/macros.rs:26
   9: <core::result::Result<T, E>>::unwrap
             at /checkout/src/libcore/result.rs:782
  10: process_execution::remote::tests::wait_between_request_3_retry
             at process_execution/src/remote.rs:1425
  11: process_execution::__test::TESTS::{{closure}}
             at process_execution/src/remote.rs:1403
  12: core::ops::function::FnOnce::call_once
             at /checkout/src/libcore/ops/function.rs:223
  13: <F as alloc::boxed::FnBox<A>>::call_box
             at libtest/lib.rs:1451
             at /checkout/src/libcore/ops/function.rs:223
             at /checkout/src/liballoc/boxed.rs:638
  14: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:105
thread 'remote::tests::wait_between_request_3_retry' panicked at 'assertion failed: `(left == right)`
  left: `1`,
 right: `0`: Expected 1 more requests. Remaining expected responses:
(name: "gimme-foo" done: true response {type_url: "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteResponse" value: "\n\005*\003foo"}, None)
Received requests:
("ExecuteRequest", action_digest {hash: "a0777a41a9341ca91def500e9723de783a1c6b212d9ce05b451dd6086c0bb421" size_bytes: 138}, Instant { tv_sec: 528, tv_nsec: 697905093 })
("GetOperationRequest", name: "gimme-foo", Instant { tv_sec: 529, tv_nsec: 198949745 })
("GetOperationRequest", name: "gimme-foo", Instant { tv_sec: 530, tv_nsec: 199953177 })
', testutil/mock/src/execution_server.rs:105:5
stack backtrace:
   0:     0x55d3af8635ae - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::he2ad84a733ff5a63
                               at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1:     0x55d3af86c556 - std::sys_common::backtrace::print::hf6fb23a1794250a5
                               at libstd/sys_common/backtrace.rs:71
                               at libstd/sys_common/backtrace.rs:59
   2:     0x55d3af84f44b - std::panicking::default_hook::{{closure}}::ha69c5f3cb0459cff
                               at libstd/panicking.rs:211
   3:     0x55d3af84f123 - std::panicking::default_hook::hf08727ddbbbad4bb
                               at libstd/panicking.rs:227
   4:     0x55d3af84fa6e - std::panicking::rust_panic_with_hook::hd6baf0cfc62bef0f
                               at libstd/panicking.rs:463
   5:     0x55d3af84f60c - std::panicking::begin_panic_fmt::h067a7944e7e0f96e
                               at libstd/panicking.rs:350
   6:     0x55d3aef46a92 - <mock::execution_server::TestServer as core::ops::drop::Drop>::drop::h2113f323299764d7
                               at testutil/mock/src/execution_server.rs:105
   7:     0x55d3aed8ec30 - core::ptr::drop_in_place::hc515e2109c82fafb
                               at /checkout/src/libcore/ptr.rs:59
   8:     0x55d3aee0a157 - process_execution::remote::tests::wait_between_request_3_retry::h36e19785f6888b6a
                               at process_execution/src/remote.rs:1438
   9:     0x55d3aee36be9 - process_execution::__test::TESTS::{{closure}}::h3a36c589dcbe4962
                               at process_execution/src/remote.rs:1403
  10:     0x55d3aed82ead - core::ops::function::FnOnce::call_once::h92a2c7c1a47def5b
                               at /checkout/src/libcore/ops/function.rs:223
  11:     0x55d3aee85eee - <F as alloc::boxed::FnBox<A>>::call_box::h95acd7f5d5222eb8
                               at libtest/lib.rs:1451
                               at /checkout/src/libcore/ops/function.rs:223
                               at /checkout/src/liballoc/boxed.rs:638
  12:     0x55d3af8735d9 - __rust_maybe_catch_panic
                               at libpanic_unwind/lib.rs:105
  13:     0x55d3aee9a388 - std::sys_common::backtrace::__rust_begin_short_backtrace::h4351ab9fa3965f87
                               at /checkout/src/libstd/panicking.rs:289
                               at /checkout/src/libstd/panic.rs:374
                               at libtest/lib.rs:1406
                               at /checkout/src/libstd/sys_common/backtrace.rs:136
  14:     0x55d3aeeb1854 - std::panicking::try::do_call::hf99e9099bc1398a8
                               at /checkout/src/libstd/thread/mod.rs:409
                               at /checkout/src/libstd/panic.rs:305
                               at /checkout/src/libstd/panicking.rs:310
  15:     0x55d3af8735d9 - __rust_maybe_catch_panic
                               at libpanic_unwind/lib.rs:105
  16:     0x55d3aee9bdb6 - <F as alloc::boxed::FnBox<A>>::call_box::hfa3a2e2797accd88
                               at /checkout/src/libstd/panicking.rs:289
                               at /checkout/src/libstd/panic.rs:374
                               at /checkout/src/libstd/thread/mod.rs:408
                               at /checkout/src/liballoc/boxed.rs:638
  17:     0x55d3af86562a - std::sys_common::thread::start_thread::hd31f5c5fa3e47024
                               at /checkout/src/liballoc/boxed.rs:648
                               at libstd/sys_common/thread.rs:24
  18:     0x55d3af850085 - std::sys::unix::thread::Thread::new::thread_start::hb3e7825324ee0f62
                               at libstd/sys/unix/thread.rs:90
  19:     0x7f67a3ce6183 - start_thread
  20:     0x7f67a37fcffc - clone
  21:                0x0 - <unknown>
thread panicked while panicking. aborting.
error: process didn't exit successfully: `/home/travis/build/pantsbuild/pants/src/rust/engine/target/debug/deps/process_execution-1103f35cbf54936e  --nocapture` (signal: 4, SIGILL: illegal instruction)
Pants rust test failure
@kwlzn kwlzn added the remoting label Jul 12, 2018
@kwlzn
Copy link
Member Author

kwlzn commented Jul 12, 2018

cc @illicitonion

@illicitonion
Copy link
Contributor

Interesting... So the test is expecting to make four requests, and complete them within 5 seconds.
The server is seeing three requests at t=0, t=0.5, t=1.5, so shouldn't be timing out.
Then the client is seeing a cancelled response - this happens when the server drops its UnarySink before writing anything to it. This shouldn't be happening according to the code, but may be a weird timing effect of our whole "pretend a streaming RPC is unary RPC" thing. Accordingly, the client is failing.. Then when the server is dropped, it's failing its assertion that it got all four requests it was meant to see.

Some things we know for sure:

  1. The server hasn't been dropped. It gets dropped later, which is why we see the panicked while panicking. So it should still be serving.
  2. The client hasn't timed out - we'd see a message Exceeded time out... rather than Cancelled in that case.

We should probably be robust to to the server cancelling individual requests. Accordingly, we should change this code:

operations_client
.get()
.get_operation(&operation_request)
.map_err(rpcerror_to_string),
so that if it gets a Cancelled error, it translates that into a ExecutionError::NotFinished which will trigger the loop to benignly retry.

I imagine this looking like a new .or_else call before the .map_err call in that block, which matches to see if the error is an RpcFailure with status Cancelled, and if so, returns an Ok of an Operation with a name set to operation_request.take_name(), and otherwise returns the error back.

@illicitonion
Copy link
Contributor

We should also add a test which has the mock execution server not respond to a request, and verify that it is retried :)

@stuhood stuhood self-assigned this Jul 19, 2018
stuhood pushed a commit that referenced this issue Aug 15, 2018
### Problem

It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on #6100.

### Solution

Recover from `RpcStatus::Cancelled` by retrying the same operation.

### Result

Without the Fixes #6100.
blorente pushed a commit to blorente/pants that referenced this issue Aug 15, 2018
### Problem

It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on pantsbuild#6100.

### Solution

Recover from `RpcStatus::Cancelled` by retrying the same operation.

### Result

Without the Fixes pantsbuild#6100.
Eric-Arellano pushed a commit to Eric-Arellano/pants that referenced this issue Aug 18, 2018
commit f239399
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 22:34:25 2018 -0700

    Remove stray pdb import

commit 3babfb9
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 22:23:19 2018 -0700

    Add back Py3 tests that now work

commit a46b827
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 22:22:52 2018 -0700

    Fix backend python issues

commit 88fa18a
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 20:08:42 2018 -0700

    Fix backend project_info issues

commit f4d4a2d
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 19:54:39 2018 -0700

    Actually fix console separator issues

commit d0c83aa
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 19:38:39 2018 -0700

    Fix issues with backend native

commit 9a1b3b7
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 19:33:48 2018 -0700

    Fix most issues with backend jvm

commit 62bc2a4
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 14:47:42 2018 -0700

    Fix logging issue with passing object to format()

commit a4bf759
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 08:05:07 2018 -0700

    Decode test's execute_task() output

commit b81cd24
Author: Eric Arellano <earellano@foursquare.com>
Date:   Thu Aug 16 08:03:57 2018 -0700

    Fix issues with backend graph_info

commit 5d28b9a
Author: Eric Arellano <earellano@foursquare.com>
Date:   Sun Aug 12 23:13:06 2018 -0700

    Fix console_task's console_separator unicode issues

commit a8674f9
Author: Eric Arellano <earellano@foursquare.com>
Date:   Sun Aug 12 22:48:42 2018 -0700

    Fix backend codegen unicode vs bytes issues

commit c893f8c
Author: Eric Arellano <earellano@foursquare.com>
Date:   Sun Aug 12 22:47:04 2018 -0700

    Return unicode from stdout in pants integration test

commit f4ab07c
Author: Benjy Weinberger <benjyw@gmail.com>
Date:   Thu Aug 16 19:43:41 2018 -0600

    Make requirements on codegen products optional. (pantsbuild#6357)

    We declare product requirements in various places to force
    codegen to run before, e.g., dependency resolution.

    However currently these requirements are mandatory, so
    deregistering all codegen backends will cause the round engine
    to find no producers of these products, and fail.

    This commit changes these requirements to optional, so
    that if there are no codegen backends, we don't attempt
    to depend on their products.

    Also, minor unrelated documentation nits that weren't worth a
    separate pull request.

commit b0a2469
Author: Eric Arellano <ericarellano@me.com>
Date:   Thu Aug 16 12:08:29 2018 -0700

    Python 3 - fixes to get green contrib (pantsbuild#6340)

    All contrib folders should now be green on Python 2 and Python 3, excluding `googlejavaformat` and `python`.

    This fixes multiple issues causing Py3 failures + adds Py3 to Travis.

commit 070a5c5
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Wed Aug 15 19:30:15 2018 -0700

    Run clippy with nightly rust on CI (pantsbuild#6347)

    This enables all default lints, and a selection of pedantic ones.

    It also fixes all code to be clean with respect to those lints.

    Each commit is independently reviewable.

    It's a shame that the lint selecting can't be shared across creates in some way.

commit 8ce6297
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 20:04:02 2018 -0700

    Use --entry-point not -c when building pex (pantsbuild#6349)

    I deterministically get this error:

    RuntimeError: Ambiguous script specification pants matches multiple entry points:pants = pants.bin.pants_loader:main pants = pants.bin.pants_loader:main

    when building with -c.

    pantsbuild#6267 changed this behaviour,
    but seems broken - this commit allows me to successfully build pexes.

commit 30878d8
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 19:11:19 2018 -0700

    Fix formatting of store.rs (pantsbuild#6350)

    It looks like pantsbuild#6336 merged with bad formatting.

commit 5f2ad63
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 18:40:32 2018 -0700

    Allow pex download path to be overridden (pantsbuild#6348)

commit b7c607f
Author: Stu Hood <stuhood@twitter.com>
Date:   Tue Aug 14 17:40:34 2018 -0700

    Recover from cancelled remote execution RPCs (pantsbuild#6188)

    ### Problem

    It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on pantsbuild#6100.

    ### Solution

    Recover from `RpcStatus::Cancelled` by retrying the same operation.

    ### Result

    Without the Fixes pantsbuild#6100.

commit a96001e
Author: Ity Kaul <ity@users.noreply.github.com>
Date:   Tue Aug 14 16:19:31 2018 -0700

    Download Directory recursively from remote CAS  (pantsbuild#6336)

    ### Problem

    We dont currently have the ability to download a `Directory`, and its contents, recursively.

    ### Solution

    Add `ensure_local_has_recursive_directory()` to be able to download  a `Directory`, and its contents, recursively.

commit e391cd1
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Tue Aug 14 16:11:22 2018 -0700

    Process execution: Create symlink to JDK on demand (pantsbuild#6346)

    This creates a symlink `.jdk` pointing at the passed JDK, if one was
    passed.

    This is a hack to allow for a remote execution platform to provide a
    pointer at its pre-installed JDK without needing to pollute absolute
    paths into the action definition.

    Long-term, we want the JDK to be part of the input directory for the
    action, but that's a lot of work, and we want to experiment with
    pre-installed JDKs in the mean time.

commit d79b7c3
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 21:26:08 2018 -0700

    Simplify ExecuteProcessRequest construction (pantsbuild#6345)

    Remove create_from_snapshot and create_with_empty_snapshot - they don't
    pull their weight.

    Allow default values for all defaultable values.

commit 33674ce
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 18:57:20 2018 -0700

    Use forked version of grpcio (pantsbuild#6344)

    This includes tikv/grpc-rs#211 which stops us
    seeing RpcFinished errors when the server responds very quickly to
    requests.

    Unfortunately, this will slow down building pants, as you'll need to
    checkout the git repository, but that should be a once-per-machine cost.

commit ccdf523
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 18:36:28 2018 -0700

    ci.sh uses positive rather than negative flags (pantsbuild#6342)

    This makes it much easier to reason about what's happening, and keeps
    .travis.yml more concise.

commit 68f4b59
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 16:12:44 2018 -0700

    Merge directories with identical files (pantsbuild#6343)

    This is useful for merging Snapshots for Targets if multiple targets own
    the same file. This should be rare, but does happen (e.g. multiple
    overlapping resource targets).

commit e189acd
Author: Daniel Wagner-Hall <dawagner@gmail.com>
Date:   Mon Aug 13 09:50:23 2018 -0700

    Set chunk size in process_executor (pantsbuild#6337)

commit 20d267f
Author: Paul <yaup@cs.washington.edu>
Date:   Sun Aug 12 14:41:26 2018 -0700

    added fullpath to fix path concat issue with files when not in git root (pantsbuild#6331)

    ### Problem
    As described in pantsbuild#6301 , when the git root and the build root is not the same, `changed` functionalities can be messed up because `git` returns path relative to the working directory, which produces bad paths when it is [concatenated with the git root](https://github.com/pantsbuild/pants/blob/c9f3af460a7504e082931a4b5a668b66f0cd4ed0/src/python/pants/scm/git.py#L161).

    ### Solution
    By adding the `--full-name` tag to [`git ls-files`](https://github.com/pantsbuild/pants/blob/c9f3af460a7504e082931a4b5a668b66f0cd4ed0/src/python/pants/scm/git.py#L178), the file concatenation is done properly

    ### Result
    Doing `./pants list --changed...` will concatenate file names correctly when it is not run from the git root.
CMLivingston pushed a commit to CMLivingston/pants that referenced this issue Aug 27, 2018
### Problem

It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on pantsbuild#6100.

### Solution

Recover from `RpcStatus::Cancelled` by retrying the same operation.

### Result

Without the Fixes pantsbuild#6100.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants