Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serverset: Lazily connect to a limited number of server #8165

Merged

Conversation

illicitonion
Copy link
Contributor

This allows for a connection limit, and doesn't make any connections
until they're needed.

This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.

We also now disconnect from the server when we see it fail (rather than
just stopping to use it).

This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.

@illicitonion
Copy link
Contributor Author

Depends on #8164

Copy link
Contributor

@pierrechevalier83 pierrechevalier83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Overall, the change is good. It's bringing useful functionality of reducing the amounts of active connections. Some of my comments are actually due to the structure of the pre-existing code which makes the changeset harder to read. Feel free to ignore some of the comments if you feel it isn't really the responsibility of this changeset to address them. In other words, be as much or as little of a boyscout as you see fit 😉

src/rust/engine/serverset/src/lib.rs Show resolved Hide resolved
src/rust/engine/serverset/src/lib.rs Show resolved Hide resolved
}

fn is_connected(&self, server_index: usize) -> bool {
for connection in &self.connections {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: this is a "raw for loop" reimplementation of any.
I would find:

    self
      .connections
      .iter()
      .any(|connection| connection.server_index == server_index)

more elegant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -220,65 +279,83 @@ impl<T: Clone + Send + Sync + 'static> Serverset<T> {
pub fn next(&self) -> BoxFuture<(T, HealthReportToken), String> {
let now = Instant::now();
let mut inner = self.inner.lock();
let server_count = inner.connections.len();

let existing_connections = inner.connections.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style:
I would probably move these two lines into their own well named function (in Inner) to reduce the cognitive overhead of parsing this function.
For instance, if inner.has_available_servers_to_connect_to() { ... } (better name suggestions welcome 😄 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

inner.next = inner.next.wrapping_add(1);
let server = &inner.connections[i];
for _ in 0..inner.servers.len() {
let i = inner.next_server % inner.servers.len();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would read better if the wrapping iterator logic was implemented on a substruct of inner:
inner.servers.iter().for_each(|server| ... )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

// A healthy server! Use it!
return futures::future::ok((server.server.clone(), HealthReportToken { index: i }))
.to_boxed();
} // else cannot happen, or we would already have connected and returned.
}
// Unwrap is safe because if we hadn't populated earliest_future, we would already have returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this is not new in this changeset, so don't feel like you have to act on it, but it feels like instead of having comments like this one where the ownace is on the reader to verify all of this logic, avoiding early returns and having else branches to match the if branches would make this an actual no-brainer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased this in a more iteratory way, but it still has an unwrap because the ordering of the checks we're doing makes it hard to encode the required information cleanly into the type system.

We want to say:

  1. Maybe connect (iterate to find a connectable)
  2. Else maybe reuse
  3. Else iterate to find what will next be connectable

If we were happy to unconditionally do the next finding in the maybe-connect stage, even if we may end up reusing a connection, we could phrase this as:

  1. Iterate over all servers, finding the next one we should try to connect to
  2. If we can make a connection, and the server is healthy, connect to it
  3. If nothing is connectable, and we have an open connection, re-use it
  4. Else reschedule for when something will be connectable

But because we don't want to do the iteration and time calculations in step 1 if step 2 is going to fail and we're going to use step 3, we can't just write this as one cute expression.

@illicitonion illicitonion force-pushed the dwagnerhall/serverset-connections/wip2 branch from a3701da to d226b75 Compare August 12, 2019 15:30
Copy link
Contributor

@pierrechevalier83 pierrechevalier83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring. Looks good to me 😃

}
// Unwrap is safe because if we hadn't populated earliest_future, we would already have returned.
let instant = earliest_future.unwrap();
// Unwrap is safe because _some_ server must have an unhealthy_info or we would've already
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring 😄 Reads much better 😄

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. Just nits.

It might be good to describe how the independent round-robining of the servers and connections works in a docstring somewhere.


struct Inner<T: Clone> {
// Order is immutable through the lifetime of a Serverset, but contents of the Backend may change.
Copy link
Member

@stuhood stuhood Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that eases my mind is that this vec is always the same length, afaict. Would be good to mention that here, and indicate that that is why it is ok to hold indexes into this vec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pub(crate) next: usize,
pub(crate) next_server: usize,

// Points into the servers list; may change over time.
Copy link
Member

@stuhood stuhood Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And vice-versa: the length of this vec might change, but we don't hold indexes into it (just a modded value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if inner.can_make_more_connections() {
// Find a healthy server without a connection, connect to it.
if let Some(ret) = inner.connect() {
inner.next_connection = inner.next_connection.wrapping_add(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why we increment next_connection here... the connection will be added at the end of the list, but we don't know where the index is pointing. If this is necessary would add a comment, otherwise, maybe remove to keep things clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@stuhood
Copy link
Member

stuhood commented Aug 14, 2019

Hm. There appears to be a reproducible issue with test remoting here, which causes the v2-remote-unit tests to fail. I've cleared cache on it, and retried a few times. The shard ends up "terminated", but much too quickly (in about 2 minutes):

/home/travis/.travis/functions: line 524:  5266 Terminated              travis_jigger "${!}" "${timeout}" "${cmd[@]}"

I'll increase debug output a bit, and see whether I can remove the wait.

@stuhood
Copy link
Member

stuhood commented Aug 15, 2019

The exposed error is:

Throw(Failed to execute process: Failed after 6 retries; last failure: Error attempting to upload digest Digest(Fingerprint<20e24a64653fc16455f152ef768932622d89ea6c0ef18a8e15c6f0ae7d25a30d>, 146741653): RpcFinished(Some(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded"), status_proto_bytes: None })))

Since we're not seeing it on master, I do expect that it is related. Will take a look today.

EDIT: I investigated this a bit, and AFAICT, the "deadline" we have set is 60 seconds, which makes hitting it 5 times fairly unlikely. But also, this particular failure is not reproducible. The truncation appears to be randomish, so I'm now suspecting either a panic or some other terminal error (sigsegv?) that is preventing us from getting a useful exit status due to the ci.py wrapper.
EDIT2: BUT, I was able to repro this error locally once with:

./build-support/bin/ci.py --bootstrap
./build-support/bin/ci.py --python-tests-v2 --remote-execution-enabled

EDIT3: But only able to repro it once.

@stuhood
Copy link
Member

stuhood commented Aug 16, 2019

So after the above investigation, I'm back to thinking that the process is terminating abnormally, because I was able to get a "successful" (except for tests failing due to #8067) run of the unit test shard locally using the ci.py commands above.

@Eric-Arellano
Copy link
Contributor

The exposed error is:

Throw(Failed to execute process: Failed after 6 retries; last failure: Error attempting to upload digest Digest(Fingerprint<20e24a64653fc16455f152ef768932622d89ea6c0ef18a8e15c6f0ae7d25a30d>, 146741653): RpcFinished(Some(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded"), status_proto_bytes: None })))

This is the exact error I have when remoting tests on my Ubuntu VM! The tests pass if everything is already cached, like util:strutil, but fail for anything that requires an upload, with this exact error. Specifically, this is the command I've been trying to run:

./pants --pants-config-files=pants.remote.ini --no-v1 --v2 --remote-oauth-bearer-token-path=<(gcloud auth application-default print-access-token | perl -p -e 'chomp if eof') --process-execution-speculation-strategy=none --remote-ca-certs-path=/snap/google-cloud-sdk/91/lib/third_party/grpc/_cython/_credentials/roots.pem  test tests/python/pants_test/backend/jvm/tasks/jvm_compile:jvm_compile

I'd feel much less crazy if this ends up being related. Solving this will be a huge help to unblocking me from remoting integration tests as well.

@stuhood
Copy link
Member

stuhood commented Aug 20, 2019

This is the exact error I have when remoting tests on my Ubuntu VM! The tests pass if everything is already cached, like util:strutil, but fail for anything that requires an upload, with this exact error.

@Eric-Arellano: Hmmm. While I repro'd this once locally, I also had a (mostly) successful result locally. So I suspect that that portion of the error might just be spurious.

The truncation aspect of #8165 (comment) is the thing I need to investigate tomorrow.

This allows for a connetion limit, and doesn't make any connections
until they're needed.

This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.

We also now disconnect from the server when we see it fail (rather than
just stopping to use it).

This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.
@stuhood stuhood force-pushed the dwagnerhall/serverset-connections/wip2 branch 2 times, most recently from e846f05 to 38e012c Compare August 20, 2019 22:25
@stuhood
Copy link
Member

stuhood commented Aug 20, 2019

Ok, progress. It looks like the process running the tests is segfaulting.

EDIT: Dangit: didn't get enough sleep. Signal 11 is sigsegv, not sighup.

@stuhood stuhood force-pushed the dwagnerhall/serverset-connections/wip2 branch from 160ad91 to bbd0c39 Compare August 21, 2019 02:50
@stuhood
Copy link
Member

stuhood commented Aug 21, 2019

bbd0c39 ended up being the reason for the segfault. Will figure that one out later.

@stuhood stuhood merged commit 70cb3c3 into pantsbuild:master Aug 21, 2019
@stuhood stuhood deleted the dwagnerhall/serverset-connections/wip2 branch August 21, 2019 04:07
@stuhood stuhood added this to the 1.19.x milestone Aug 21, 2019
stuhood pushed a commit that referenced this pull request Aug 22, 2019
This allows for a connection limit, and doesn't make any connections
until they're needed.

This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.

We also now disconnect from the server when we see it fail (rather than
just stopping to use it).

This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants