-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from cancelled remote execution RPCs #6188
Recover from cancelled remote execution RPCs #6188
Conversation
@illicitonion : I'd like to add a test for this, but could use a pointer in the right direction. I was thinking to imitate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, apart from not having a test case yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test-wise, we're looking for the server to cancel the connection in this function:
pants/src/rust/engine/testutil/mock/src/execution_server.rs
Lines 159 to 183 in 6370c6c
fn send_next_operation_unary( | |
&self, | |
sink: grpcio::UnarySink<super::bazel_protos::operations::Operation>, | |
) { | |
match self | |
.mock_execution | |
.operation_responses | |
.lock() | |
.unwrap() | |
.pop_front() | |
{ | |
Some((op, duration)) => { | |
if let Some(d) = duration { | |
sleep(d); | |
} | |
sink.success(op.clone()); | |
} | |
None => { | |
sink.fail(grpcio::RpcStatus::new( | |
grpcio::RpcStatusCode::InvalidArgument, | |
Some("Did not expect further requests from client.".to_string()), | |
)); | |
} | |
} | |
} |
I believe that just dropping a UnarySink
without writing anything to it will perform a cancellation. But if not, we would need to explicitly send a cancelled status response.
To go about this, we'll want to add some sentinel to the operation_responses
queue:
pants/src/rust/engine/testutil/mock/src/execution_server.rs
Lines 19 to 21 in 6370c6c
operation_responses: | |
Arc<Mutex<VecDeque<(bazel_protos::operations::Operation, Option<Duration>)>>>, | |
} |
Maybe we want to make the type VecDeque<(Option<bazel_protos::operations::Operation>, Option<Duration>)>>>
and cancel on a None
. This is clear, but pretty intrusive just to cover one test case.
Maybe we want to just have a magic name
value on an Operation
or something to signify that we should cancel. This is less clear, but minimally invasive.
_ => {} | ||
} | ||
// Did not represent cancellation. | ||
Err(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could just be the RHS of the _
match clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would complicate things, because I'd have to take the err by value, which means I could not have a guard clause on that branch of the match.
148b366
to
33e53e6
Compare
Added a test in two commits: useful to review independently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great :) Thanks!
@@ -169,7 +190,17 @@ impl MockResponder { | |||
.unwrap() | |||
.pop_front() | |||
{ | |||
Some((op, duration)) => { | |||
Some(MockOperation { op: None, duration }) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would probably be clearer by splitting into two nested matches, rather than having everything be one top-level match
match ... {
Some(mock_operation) {
let MockOperation { op: maybe_op, duration: maybe_duration } = mock_operation;
if let Some(duration) = maybe_duration {
sleep(d);
}
if let Some(op) = maybe_op {
// send
} else {
drop(sink);
}
},
None => // fail
}
@@ -1780,7 +1790,10 @@ mod tests { | |||
}); | |||
response | |||
})); | |||
(operation, None) | |||
MockOperation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MockOperation::new(operation)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
### Problem It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on pantsbuild#6100. ### Solution Recover from `RpcStatus::Cancelled` by retrying the same operation. ### Result Without the Fixes pantsbuild#6100.
commit f239399 Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 22:34:25 2018 -0700 Remove stray pdb import commit 3babfb9 Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 22:23:19 2018 -0700 Add back Py3 tests that now work commit a46b827 Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 22:22:52 2018 -0700 Fix backend python issues commit 88fa18a Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 20:08:42 2018 -0700 Fix backend project_info issues commit f4d4a2d Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 19:54:39 2018 -0700 Actually fix console separator issues commit d0c83aa Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 19:38:39 2018 -0700 Fix issues with backend native commit 9a1b3b7 Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 19:33:48 2018 -0700 Fix most issues with backend jvm commit 62bc2a4 Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 14:47:42 2018 -0700 Fix logging issue with passing object to format() commit a4bf759 Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 08:05:07 2018 -0700 Decode test's execute_task() output commit b81cd24 Author: Eric Arellano <earellano@foursquare.com> Date: Thu Aug 16 08:03:57 2018 -0700 Fix issues with backend graph_info commit 5d28b9a Author: Eric Arellano <earellano@foursquare.com> Date: Sun Aug 12 23:13:06 2018 -0700 Fix console_task's console_separator unicode issues commit a8674f9 Author: Eric Arellano <earellano@foursquare.com> Date: Sun Aug 12 22:48:42 2018 -0700 Fix backend codegen unicode vs bytes issues commit c893f8c Author: Eric Arellano <earellano@foursquare.com> Date: Sun Aug 12 22:47:04 2018 -0700 Return unicode from stdout in pants integration test commit f4ab07c Author: Benjy Weinberger <benjyw@gmail.com> Date: Thu Aug 16 19:43:41 2018 -0600 Make requirements on codegen products optional. (pantsbuild#6357) We declare product requirements in various places to force codegen to run before, e.g., dependency resolution. However currently these requirements are mandatory, so deregistering all codegen backends will cause the round engine to find no producers of these products, and fail. This commit changes these requirements to optional, so that if there are no codegen backends, we don't attempt to depend on their products. Also, minor unrelated documentation nits that weren't worth a separate pull request. commit b0a2469 Author: Eric Arellano <ericarellano@me.com> Date: Thu Aug 16 12:08:29 2018 -0700 Python 3 - fixes to get green contrib (pantsbuild#6340) All contrib folders should now be green on Python 2 and Python 3, excluding `googlejavaformat` and `python`. This fixes multiple issues causing Py3 failures + adds Py3 to Travis. commit 070a5c5 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Wed Aug 15 19:30:15 2018 -0700 Run clippy with nightly rust on CI (pantsbuild#6347) This enables all default lints, and a selection of pedantic ones. It also fixes all code to be clean with respect to those lints. Each commit is independently reviewable. It's a shame that the lint selecting can't be shared across creates in some way. commit 8ce6297 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Tue Aug 14 20:04:02 2018 -0700 Use --entry-point not -c when building pex (pantsbuild#6349) I deterministically get this error: RuntimeError: Ambiguous script specification pants matches multiple entry points:pants = pants.bin.pants_loader:main pants = pants.bin.pants_loader:main when building with -c. pantsbuild#6267 changed this behaviour, but seems broken - this commit allows me to successfully build pexes. commit 30878d8 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Tue Aug 14 19:11:19 2018 -0700 Fix formatting of store.rs (pantsbuild#6350) It looks like pantsbuild#6336 merged with bad formatting. commit 5f2ad63 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Tue Aug 14 18:40:32 2018 -0700 Allow pex download path to be overridden (pantsbuild#6348) commit b7c607f Author: Stu Hood <stuhood@twitter.com> Date: Tue Aug 14 17:40:34 2018 -0700 Recover from cancelled remote execution RPCs (pantsbuild#6188) ### Problem It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on pantsbuild#6100. ### Solution Recover from `RpcStatus::Cancelled` by retrying the same operation. ### Result Without the Fixes pantsbuild#6100. commit a96001e Author: Ity Kaul <ity@users.noreply.github.com> Date: Tue Aug 14 16:19:31 2018 -0700 Download Directory recursively from remote CAS (pantsbuild#6336) ### Problem We dont currently have the ability to download a `Directory`, and its contents, recursively. ### Solution Add `ensure_local_has_recursive_directory()` to be able to download a `Directory`, and its contents, recursively. commit e391cd1 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Tue Aug 14 16:11:22 2018 -0700 Process execution: Create symlink to JDK on demand (pantsbuild#6346) This creates a symlink `.jdk` pointing at the passed JDK, if one was passed. This is a hack to allow for a remote execution platform to provide a pointer at its pre-installed JDK without needing to pollute absolute paths into the action definition. Long-term, we want the JDK to be part of the input directory for the action, but that's a lot of work, and we want to experiment with pre-installed JDKs in the mean time. commit d79b7c3 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Mon Aug 13 21:26:08 2018 -0700 Simplify ExecuteProcessRequest construction (pantsbuild#6345) Remove create_from_snapshot and create_with_empty_snapshot - they don't pull their weight. Allow default values for all defaultable values. commit 33674ce Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Mon Aug 13 18:57:20 2018 -0700 Use forked version of grpcio (pantsbuild#6344) This includes tikv/grpc-rs#211 which stops us seeing RpcFinished errors when the server responds very quickly to requests. Unfortunately, this will slow down building pants, as you'll need to checkout the git repository, but that should be a once-per-machine cost. commit ccdf523 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Mon Aug 13 18:36:28 2018 -0700 ci.sh uses positive rather than negative flags (pantsbuild#6342) This makes it much easier to reason about what's happening, and keeps .travis.yml more concise. commit 68f4b59 Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Mon Aug 13 16:12:44 2018 -0700 Merge directories with identical files (pantsbuild#6343) This is useful for merging Snapshots for Targets if multiple targets own the same file. This should be rare, but does happen (e.g. multiple overlapping resource targets). commit e189acd Author: Daniel Wagner-Hall <dawagner@gmail.com> Date: Mon Aug 13 09:50:23 2018 -0700 Set chunk size in process_executor (pantsbuild#6337) commit 20d267f Author: Paul <yaup@cs.washington.edu> Date: Sun Aug 12 14:41:26 2018 -0700 added fullpath to fix path concat issue with files when not in git root (pantsbuild#6331) ### Problem As described in pantsbuild#6301 , when the git root and the build root is not the same, `changed` functionalities can be messed up because `git` returns path relative to the working directory, which produces bad paths when it is [concatenated with the git root](https://github.com/pantsbuild/pants/blob/c9f3af460a7504e082931a4b5a668b66f0cd4ed0/src/python/pants/scm/git.py#L161). ### Solution By adding the `--full-name` tag to [`git ls-files`](https://github.com/pantsbuild/pants/blob/c9f3af460a7504e082931a4b5a668b66f0cd4ed0/src/python/pants/scm/git.py#L178), the file concatenation is done properly ### Result Doing `./pants list --changed...` will concatenate file names correctly when it is not run from the git root.
### Problem It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on pantsbuild#6100. ### Solution Recover from `RpcStatus::Cancelled` by retrying the same operation. ### Result Without the Fixes pantsbuild#6100.
Problem
It's possible for an individual RPC to be cancelled while an entire Operation continues, but this is not currently handled. See the problem description on #6100.
Solution
Recover from
RpcStatus::Cancelled
by retrying the same operation.Result
Fixes #6100.