-
-
Notifications
You must be signed in to change notification settings - Fork 632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serverset: Lazily connect to a limited number of server #8165
Serverset: Lazily connect to a limited number of server #8165
Conversation
Depends on #8164 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments. Overall, the change is good. It's bringing useful functionality of reducing the amounts of active connections. Some of my comments are actually due to the structure of the pre-existing code which makes the changeset harder to read. Feel free to ignore some of the comments if you feel it isn't really the responsibility of this changeset to address them. In other words, be as much or as little of a boyscout as you see fit 😉
src/rust/engine/serverset/src/lib.rs
Outdated
} | ||
|
||
fn is_connected(&self, server_index: usize) -> bool { | ||
for connection in &self.connections { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: this is a "raw for loop" reimplementation of any
.
I would find:
self
.connections
.iter()
.any(|connection| connection.server_index == server_index)
more elegant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/rust/engine/serverset/src/lib.rs
Outdated
@@ -220,65 +279,83 @@ impl<T: Clone + Send + Sync + 'static> Serverset<T> { | |||
pub fn next(&self) -> BoxFuture<(T, HealthReportToken), String> { | |||
let now = Instant::now(); | |||
let mut inner = self.inner.lock(); | |||
let server_count = inner.connections.len(); | |||
|
|||
let existing_connections = inner.connections.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style:
I would probably move these two lines into their own well named function (in Inner) to reduce the cognitive overhead of parsing this function.
For instance, if inner.has_available_servers_to_connect_to() { ... }
(better name suggestions welcome 😄 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/rust/engine/serverset/src/lib.rs
Outdated
inner.next = inner.next.wrapping_add(1); | ||
let server = &inner.connections[i]; | ||
for _ in 0..inner.servers.len() { | ||
let i = inner.next_server % inner.servers.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would read better if the wrapping iterator logic was implemented on a substruct of inner:
inner.servers.iter().for_each(|server| ... )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above
src/rust/engine/serverset/src/lib.rs
Outdated
// A healthy server! Use it! | ||
return futures::future::ok((server.server.clone(), HealthReportToken { index: i })) | ||
.to_boxed(); | ||
} // else cannot happen, or we would already have connected and returned. | ||
} | ||
// Unwrap is safe because if we hadn't populated earliest_future, we would already have returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get this is not new in this changeset, so don't feel like you have to act on it, but it feels like instead of having comments like this one where the ownace is on the reader to verify all of this logic, avoiding early returns and having else
branches to match the if
branches would make this an actual no-brainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased this in a more iteratory way, but it still has an unwrap
because the ordering of the checks we're doing makes it hard to encode the required information cleanly into the type system.
We want to say:
- Maybe connect (iterate to find a connectable)
- Else maybe reuse
- Else iterate to find what will next be connectable
If we were happy to unconditionally do the next finding in the maybe-connect stage, even if we may end up reusing a connection, we could phrase this as:
- Iterate over all servers, finding the next one we should try to connect to
- If we can make a connection, and the server is healthy, connect to it
- If nothing is connectable, and we have an open connection, re-use it
- Else reschedule for when something will be connectable
But because we don't want to do the iteration and time calculations in step 1 if step 2 is going to fail and we're going to use step 3, we can't just write this as one cute expression.
a3701da
to
d226b75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the refactoring. Looks good to me 😃
} | ||
// Unwrap is safe because if we hadn't populated earliest_future, we would already have returned. | ||
let instant = earliest_future.unwrap(); | ||
// Unwrap is safe because _some_ server must have an unhealthy_info or we would've already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactoring 😄 Reads much better 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good. Just nits.
It might be good to describe how the independent round-robining of the servers and connections works in a docstring somewhere.
src/rust/engine/serverset/src/lib.rs
Outdated
|
||
struct Inner<T: Clone> { | ||
// Order is immutable through the lifetime of a Serverset, but contents of the Backend may change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that eases my mind is that this vec is always the same length, afaict. Would be good to mention that here, and indicate that that is why it is ok to hold indexes into this vec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/rust/engine/serverset/src/lib.rs
Outdated
pub(crate) next: usize, | ||
pub(crate) next_server: usize, | ||
|
||
// Points into the servers list; may change over time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And vice-versa: the length of this vec might change, but we don't hold indexes into it (just a modded value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/rust/engine/serverset/src/lib.rs
Outdated
if inner.can_make_more_connections() { | ||
// Find a healthy server without a connection, connect to it. | ||
if let Some(ret) = inner.connect() { | ||
inner.next_connection = inner.next_connection.wrapping_add(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why we increment next_connection
here... the connection will be added at the end of the list, but we don't know where the index is pointing. If this is necessary would add a comment, otherwise, maybe remove to keep things clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hm. There appears to be a reproducible issue with test remoting here, which causes the v2-remote-unit tests to fail. I've cleared cache on it, and retried a few times. The shard ends up "terminated", but much too quickly (in about 2 minutes):
I'll increase debug output a bit, and see whether I can remove the wait. |
The exposed error is:
Since we're not seeing it on master, I do expect that it is related. Will take a look today. EDIT: I investigated this a bit, and AFAICT, the "deadline" we have set is 60 seconds, which makes hitting it 5 times fairly unlikely. But also, this particular failure is not reproducible. The truncation appears to be randomish, so I'm now suspecting either a panic or some other terminal error (sigsegv?) that is preventing us from getting a useful exit status due to the
EDIT3: But only able to repro it once. |
So after the above investigation, I'm back to thinking that the process is terminating abnormally, because I was able to get a "successful" (except for tests failing due to #8067) run of the unit test shard locally using the |
This is the exact error I have when remoting tests on my Ubuntu VM! The tests pass if everything is already cached, like
I'd feel much less crazy if this ends up being related. Solving this will be a huge help to unblocking me from remoting integration tests as well. |
@Eric-Arellano: Hmmm. While I repro'd this once locally, I also had a (mostly) successful result locally. So I suspect that that portion of the error might just be spurious. The truncation aspect of #8165 (comment) is the thing I need to investigate tomorrow. |
This allows for a connetion limit, and doesn't make any connections until they're needed. This may slightly slow down the first few requests, but means that we won't proactively connect to a large number of servers. We also now disconnect from the server when we see it fail (rather than just stopping to use it). This makes the order of server connections slightly less guaranteed; before we would strictly round-robin, whereas now we may skip some servers or use them twice in a row when we connect/disconnect a new server.
e846f05
to
38e012c
Compare
Ok, progress. It looks like the process running the tests is segfaulting. EDIT: Dangit: didn't get enough sleep. Signal 11 is sigsegv, not sighup. |
160ad91
to
bbd0c39
Compare
bbd0c39 ended up being the reason for the segfault. Will figure that one out later. |
This allows for a connection limit, and doesn't make any connections until they're needed. This may slightly slow down the first few requests, but means that we won't proactively connect to a large number of servers. We also now disconnect from the server when we see it fail (rather than just stopping to use it). This makes the order of server connections slightly less guaranteed; before we would strictly round-robin, whereas now we may skip some servers or use them twice in a row when we connect/disconnect a new server.
This allows for a connection limit, and doesn't make any connections
until they're needed.
This may slightly slow down the first few requests, but means that we
won't proactively connect to a large number of servers.
We also now disconnect from the server when we see it fail (rather than
just stopping to use it).
This makes the order of server connections slightly less guaranteed;
before we would strictly round-robin, whereas now we may skip some
servers or use them twice in a row when we connect/disconnect a new
server.