Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor client RPC timeouts #14965
Refactor client RPC timeouts #14965
Changes from 2 commits
40e5e49
93800b6
22ca463
47546ba
135a7ff
370700f
5ac6590
500c939
40510ec
7e89b91
e061482
9325aae
ea02dd6
b4190ab
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to sanity check that the setting the deadline on the stream connection did not persist when the client was cached in the connection pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these tests depend on timing differences of 16ms then they'll probably end up being pretty flakey in CI right? Or did I misunderstand the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking query default timeout (in the test setup): 100ms
Long.Wait
returns after: 101msThe max jitter is added when calculating blocking query timeouts: 100ms + 100/16 = 116ms
So theoretically if max jitter is ever applied by the server and it returns in 117ms then the test will flake. I will make the
Long.Wait
exactly 100msThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side comment to consider in "the future":
Do we have a lot of timing dependent (possibly flaky) tests? If so, it might be worthwhile at some point to develop a mockable interface passed in via dependency injection so we can use real timing capabilities in the real code, but mocked versions (that can't flake) in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! We actually have an established mock in a few places like
consul/agent/grpc-external/services/peerstream/testing.go
Line 136 in 80e51ff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past timing-dependent tests have needed error margins in the order of 500ms or more to not be very flaky in CI.
What happens if CI CPU is starved and so the goroutine handling the RPC isn't scheduled for 200ms and so doesn't "notice" the timeout, that would flake right?
Usually I'd treat tests like this that use actual non-deterministic goroutines and wall clock time as a very very rough smoke test that the timeout does anything at all rather than trying to assert anything very specific about exactly how long the timeout was. i.e. make it so that if you comment out the timeout code it always fails but maybe after like several seconds then you can use much larger margins.
All that said, I've been much further from out test suite than you for a long time @kisunji so if you have good reason to think this will be robust I defer to your greater knowledge! I just know that any sleep-style test with timing tolerances lower than multiple hundreds of milliseconds used to be a non-starter in CI 😆 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is simply reverting a change from #11500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was changed in #11500 but I reverted it back to
DefaultDialTimeout
to be consistent with other usages.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that narrowly defining this interface was better than giving
RPCInfo
aTimeout
method and forcing every type to return a timeout (in most cases is rpc_hold_timeout)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like
deadline
isn't initialized if we cannot cast to a BlockableQuery. Will that cause issues for thisSetReadDeadline(deadline)
that will receive a timestamp from the past?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A zero value for
SetReadDeadline
will set no deadline, which is intended for writes and I think an acceptable fallback for any other types of queries that don't implementBlockableQuery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time working out whether this is the same behaviour as before - did
RPCInfo
only get implemented for Reads? If not then it seems like this new code is now never setting a timeout on write requests where it was before.Separate issue: assuming it's the same, is there a reason we allow writes to take arbitrarily long? I could understand having a separate timeout for them because they might have to wait for Raft etc. but it seems like it would be ideal if we didn't have zero timeout on writes...
That said, (assuming the behaviour is same as before) it's better to do that in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I took a look at the original PR and it seems to apply
rpc_hold_timeout
to all RPCs (I conflatednet.Conn.Read
with "read" requests).I think the correct solution would be to rename
rpc_read_timeout
torpc_client_timeout
and have it apply the same read timeout to all requests (queries, writes, basically anything except blocking-queries).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@banks I understand the concern about writes. But given the fact that we had 7s timeouts on writes in the last few releases and the only issues raised by the community and customers were regarding non-blocking queries, I think it's safe to apply a larger (maybe 300s) timeout across the board. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can solve the issue of not having timeouts on writes first and if anything comes up where we find that write RPCs need to be tuned separately, maybe another config may be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice. Yeah I couldn't recall if this was everything. If the old behaviour was to timeout writes too then I'd def not want to remove that!
Yeah this sounds ideal to me for now. We can add more config if we really need to.