Serverset: Lazily connect to a limited number of server #8165

illicitonion · 2019-08-12T09:56:20Z

This allows for a connection limit, and doesn't make any connections
until they're needed.

This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.

We also now disconnect from the server when we see it fail (rather than
just stopping to use it).

This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.

illicitonion · 2019-08-12T09:56:35Z

Depends on #8164

pierrechevalier83

Left a few comments. Overall, the change is good. It's bringing useful functionality of reducing the amounts of active connections. Some of my comments are actually due to the structure of the pre-existing code which makes the changeset harder to read. Feel free to ignore some of the comments if you feel it isn't really the responsibility of this changeset to address them. In other words, be as much or as little of a boyscout as you see fit 😉

src/rust/engine/serverset/src/lib.rs

pierrechevalier83 · 2019-08-12T10:34:30Z

src/rust/engine/serverset/src/lib.rs

+  }
+
+  fn is_connected(&self, server_index: usize) -> bool {
+    for connection in &self.connections {


Nitpick: this is a "raw for loop" reimplementation of any.
I would find:

self .connections .iter() .any(|connection| connection.server_index == server_index)

more elegant.

pierrechevalier83 · 2019-08-12T10:58:13Z

src/rust/engine/serverset/src/lib.rs

@@ -220,65 +279,83 @@ impl<T: Clone + Send + Sync + 'static> Serverset<T> {
  pub fn next(&self) -> BoxFuture<(T, HealthReportToken), String> {
    let now = Instant::now();
    let mut inner = self.inner.lock();
-    let server_count = inner.connections.len();
+
+    let existing_connections = inner.connections.len();


Style:
I would probably move these two lines into their own well named function (in Inner) to reduce the cognitive overhead of parsing this function.
For instance, if inner.has_available_servers_to_connect_to() { ... } (better name suggestions welcome 😄 )

pierrechevalier83 · 2019-08-12T12:34:52Z

src/rust/engine/serverset/src/lib.rs

-      inner.next = inner.next.wrapping_add(1);
-      let server = &inner.connections[i];
+    for _ in 0..inner.servers.len() {
+      let i = inner.next_server % inner.servers.len();


This would read better if the wrapping iterator logic was implemented on a substruct of inner:
inner.servers.iter().for_each(|server| ... )

pierrechevalier83 · 2019-08-12T12:40:59Z

src/rust/engine/serverset/src/lib.rs

-      // A healthy server! Use it!
-      return futures::future::ok((server.server.clone(), HealthReportToken { index: i }))
-        .to_boxed();
+      } // else cannot happen, or we would already have connected and returned.
    }
    // Unwrap is safe because if we hadn't populated earliest_future, we would already have returned.


I get this is not new in this changeset, so don't feel like you have to act on it, but it feels like instead of having comments like this one where the ownace is on the reader to verify all of this logic, avoiding early returns and having else branches to match the if branches would make this an actual no-brainer.

Rephrased this in a more iteratory way, but it still has an unwrap because the ordering of the checks we're doing makes it hard to encode the required information cleanly into the type system.

We want to say:

Maybe connect (iterate to find a connectable)

Else maybe reuse

Else iterate to find what will next be connectable

If we were happy to unconditionally do the next finding in the maybe-connect stage, even if we may end up reusing a connection, we could phrase this as:

Iterate over all servers, finding the next one we should try to connect to

If we can make a connection, and the server is healthy, connect to it

If nothing is connectable, and we have an open connection, re-use it

Else reschedule for when something will be connectable

But because we don't want to do the iteration and time calculations in step 1 if step 2 is going to fail and we're going to use step 3, we can't just write this as one cute expression.

pierrechevalier83

Thanks for the refactoring. Looks good to me 😃

pierrechevalier83 · 2019-08-12T16:30:58Z

src/rust/engine/serverset/src/lib.rs

-    }
-    // Unwrap is safe because if we hadn't populated earliest_future, we would already have returned.
-    let instant = earliest_future.unwrap();
+    // Unwrap is safe because _some_ server must have an unhealthy_info or we would've already


Nice refactoring 😄 Reads much better 😄

stuhood

Thanks, looks good. Just nits.

It might be good to describe how the independent round-robining of the servers and connections works in a docstring somewhere.

stuhood · 2019-08-12T17:53:04Z

src/rust/engine/serverset/src/lib.rs


+struct Inner<T: Clone> {
+  // Order is immutable through the lifetime of a Serverset, but contents of the Backend may change.


One thing that eases my mind is that this vec is always the same length, afaict. Would be good to mention that here, and indicate that that is why it is ok to hold indexes into this vec.

stuhood · 2019-08-12T17:53:19Z

src/rust/engine/serverset/src/lib.rs

-  pub(crate) next: usize,
+  pub(crate) next_server: usize,
+
+  // Points into the servers list; may change over time.


And vice-versa: the length of this vec might change, but we don't hold indexes into it (just a modded value).

stuhood · 2019-08-12T17:55:42Z

src/rust/engine/serverset/src/lib.rs

+    if inner.can_make_more_connections() {
+      // Find a healthy server without a connection, connect to it.
+      if let Some(ret) = inner.connect() {
+        inner.next_connection = inner.next_connection.wrapping_add(1);


It's not clear why we increment next_connection here... the connection will be added at the end of the list, but we don't know where the index is pointing. If this is necessary would add a comment, otherwise, maybe remove to keep things clear?

stuhood · 2019-08-14T19:14:39Z

Hm. There appears to be a reproducible issue with test remoting here, which causes the v2-remote-unit tests to fail. I've cleared cache on it, and retried a few times. The shard ends up "terminated", but much too quickly (in about 2 minutes):

/home/travis/.travis/functions: line 524:  5266 Terminated              travis_jigger "${!}" "${timeout}" "${cmd[@]}"

I'll increase debug output a bit, and see whether I can remove the wait.

stuhood · 2019-08-15T16:26:25Z

The exposed error is:

Throw(Failed to execute process: Failed after 6 retries; last failure: Error attempting to upload digest Digest(Fingerprint<20e24a64653fc16455f152ef768932622d89ea6c0ef18a8e15c6f0ae7d25a30d>, 146741653): RpcFinished(Some(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded"), status_proto_bytes: None })))

Since we're not seeing it on master, I do expect that it is related. Will take a look today.

EDIT: I investigated this a bit, and AFAICT, the "deadline" we have set is 60 seconds, which makes hitting it 5 times fairly unlikely. But also, this particular failure is not reproducible. The truncation appears to be randomish, so I'm now suspecting either a panic or some other terminal error (sigsegv?) that is preventing us from getting a useful exit status due to the ci.py wrapper.
EDIT2: BUT, I was able to repro this error locally once with:

./build-support/bin/ci.py --bootstrap
./build-support/bin/ci.py --python-tests-v2 --remote-execution-enabled

EDIT3: But only able to repro it once.

stuhood · 2019-08-16T01:01:35Z

So after the above investigation, I'm back to thinking that the process is terminating abnormally, because I was able to get a "successful" (except for tests failing due to #8067) run of the unit test shard locally using the ci.py commands above.

Eric-Arellano · 2019-08-16T04:02:58Z

The exposed error is:

Throw(Failed to execute process: Failed after 6 retries; last failure: Error attempting to upload digest Digest(Fingerprint<20e24a64653fc16455f152ef768932622d89ea6c0ef18a8e15c6f0ae7d25a30d>, 146741653): RpcFinished(Some(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded"), status_proto_bytes: None })))

This is the exact error I have when remoting tests on my Ubuntu VM! The tests pass if everything is already cached, like util:strutil, but fail for anything that requires an upload, with this exact error. Specifically, this is the command I've been trying to run:

./pants --pants-config-files=pants.remote.ini --no-v1 --v2 --remote-oauth-bearer-token-path=<(gcloud auth application-default print-access-token | perl -p -e 'chomp if eof') --process-execution-speculation-strategy=none --remote-ca-certs-path=/snap/google-cloud-sdk/91/lib/third_party/grpc/_cython/_credentials/roots.pem  test tests/python/pants_test/backend/jvm/tasks/jvm_compile:jvm_compile

I'd feel much less crazy if this ends up being related. Solving this will be a huge help to unblocking me from remoting integration tests as well.

stuhood · 2019-08-20T05:59:49Z

This is the exact error I have when remoting tests on my Ubuntu VM! The tests pass if everything is already cached, like util:strutil, but fail for anything that requires an upload, with this exact error.

@Eric-Arellano: Hmmm. While I repro'd this once locally, I also had a (mostly) successful result locally. So I suspect that that portion of the error might just be spurious.

The truncation aspect of #8165 (comment) is the thing I need to investigate tomorrow.

This allows for a connetion limit, and doesn't make any connections until they're needed. This may slightly slow down the first few requests, but means that we won't proactively connect to a large number of servers. We also now disconnect from the server when we see it fail (rather than just stopping to use it). This makes the order of server connections slightly less guaranteed; before we would strictly round-robin, whereas now we may skip some servers or use them twice in a row when we connect/disconnect a new server.

stuhood · 2019-08-20T23:30:45Z

Ok, progress. It looks like the process running the tests is segfaulting.

EDIT: Dangit: didn't get enough sleep. Signal 11 is sigsegv, not sighup.

stuhood · 2019-08-21T04:05:37Z

bbd0c39 ended up being the reason for the segfault. Will figure that one out later.

This allows for a connection limit, and doesn't make any connections until they're needed. This may slightly slow down the first few requests, but means that we won't proactively connect to a large number of servers. We also now disconnect from the server when we see it fail (rather than just stopping to use it). This makes the order of server connections slightly less guaranteed; before we would strictly round-robin, whereas now we may skip some servers or use them twice in a row when we connect/disconnect a new server.

illicitonion requested review from pierrechevalier83 and blorente August 12, 2019 09:56

pierrechevalier83 reviewed Aug 12, 2019

View reviewed changes

illicitonion force-pushed the dwagnerhall/serverset-connections/wip2 branch from a3701da to d226b75 Compare August 12, 2019 15:30

pierrechevalier83 approved these changes Aug 12, 2019

View reviewed changes

illicitonion requested a review from stuhood August 12, 2019 17:04

stuhood approved these changes Aug 12, 2019

View reviewed changes

stuhood mentioned this pull request Aug 14, 2019

tests/python/pants_test/base:exception_sink_integration is flaky #8127

Closed

illicitonion added 3 commits August 20, 2019 12:59

Review comments

d61b04f

Review comments

39bff0f

stuhood force-pushed the dwagnerhall/serverset-connections/wip2 branch 2 times, most recently from e846f05 to 38e012c Compare August 20, 2019 22:25

Fix ordering of params.

bbd0c39

stuhood force-pushed the dwagnerhall/serverset-connections/wip2 branch from 160ad91 to bbd0c39 Compare August 21, 2019 02:50

stuhood merged commit 70cb3c3 into pantsbuild:master Aug 21, 2019

stuhood deleted the dwagnerhall/serverset-connections/wip2 branch August 21, 2019 04:07

stuhood added the needs-cherrypick label Aug 21, 2019

stuhood added this to the 1.19.x milestone Aug 21, 2019

stuhood removed the needs-cherrypick label Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serverset: Lazily connect to a limited number of server #8165

Serverset: Lazily connect to a limited number of server #8165

illicitonion commented Aug 12, 2019

illicitonion commented Aug 12, 2019

pierrechevalier83 left a comment

pierrechevalier83 Aug 12, 2019

illicitonion Aug 12, 2019

pierrechevalier83 Aug 12, 2019

illicitonion Aug 12, 2019

pierrechevalier83 Aug 12, 2019

illicitonion Aug 12, 2019

pierrechevalier83 Aug 12, 2019

illicitonion Aug 12, 2019

pierrechevalier83 left a comment

pierrechevalier83 Aug 12, 2019

stuhood left a comment

stuhood Aug 12, 2019 •

edited

Loading

illicitonion Aug 14, 2019

stuhood Aug 12, 2019 •

edited

Loading

illicitonion Aug 14, 2019

stuhood Aug 12, 2019

illicitonion Aug 14, 2019

stuhood commented Aug 14, 2019

stuhood commented Aug 15, 2019 •

edited

Loading

stuhood commented Aug 16, 2019

Eric-Arellano commented Aug 16, 2019

stuhood commented Aug 20, 2019 •

edited

Loading

stuhood commented Aug 20, 2019 •

edited

Loading

stuhood commented Aug 21, 2019


		struct Inner<T: Clone> {
		// Order is immutable through the lifetime of a Serverset, but contents of the Backend may change.

Serverset: Lazily connect to a limited number of server #8165

Serverset: Lazily connect to a limited number of server #8165

Conversation

illicitonion commented Aug 12, 2019

illicitonion commented Aug 12, 2019

pierrechevalier83 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pierrechevalier83 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

stuhood Aug 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Aug 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood commented Aug 14, 2019

stuhood commented Aug 15, 2019 • edited Loading

stuhood commented Aug 16, 2019

Eric-Arellano commented Aug 16, 2019

stuhood commented Aug 20, 2019 • edited Loading

stuhood commented Aug 20, 2019 • edited Loading

stuhood commented Aug 21, 2019

stuhood Aug 12, 2019 •

edited

Loading

stuhood Aug 12, 2019 •

edited

Loading

stuhood commented Aug 15, 2019 •

edited

Loading

stuhood commented Aug 20, 2019 •

edited

Loading

stuhood commented Aug 20, 2019 •

edited

Loading