Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A80: gRPC Metrics for TCP connection #428
base: master
Are you sure you want to change the base?
A80: gRPC Metrics for TCP connection #428
Changes from 1 commit
dedfb16
ffaeb22
d413291
5b5ba3f
583e6b3
8aa21c1
9f8038c
59ab138
ce27a69
d239c39
0726f6e
3bfe76b
2ccf768
83ac908
2a11aea
b6dc6d9
0aceebe
052d5cf
7e5bc86
092fbc1
bd18940
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markdroth @yashykt I feel like there's an equivalent label already in use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a relatively expensive fan-out for a busy server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we discussed about this offline. Is there a way to support an opaque user-defined peer info string? In prod, by default we only record the cell information due to the fan-out concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to figure out the right format here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose URI form, such as
ipv4:1.2.3.4:567
. @ejona86 @dfawley, would that be a problem for Java and Go if we ever added these metrics in your languages?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be clarified more. For instance, is
ipv4:///1.2.3.4:567
also OK? What is the format foripv6
? What other schemes do we support? @ejona86 is concerned thatipv4:1.2.3.4:567
is not a URI we should support in the first place. So maybe this should just be specified fully as strictly:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yashykt how well does seconds here align with other metrics we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are following OpenTelemetry's conventions mostly which advocates for seconds 's' as the unit for duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of remarks on this:
getsockopt
+TCP_INFO
that yields a min_rtt in some cases? Even on linux, I'm not sure all TCP algorithms measure the minimum RTT.grpc.transport.tcp
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think min_rtt is general enough that we can include, I presume most congestion control tracks RTT or equivalent to keep track of the estimated BDP.
And I think this gRFC is supposed to cover the definition of the metric and not about the collection method. @yashykt Yes, in Linux we can get it via either
getsockopt(TCP_INFO)
orTCP_NLA_MIN_RTT
fromcmsg
when timestamping is enabled.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and even if it weren't the case, there is probably no problem with omitting it.
OK, yes. My point is more that the definition needs to be precise enough so that different implementations measure the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if it's important for the correctness of the metrics, it makes sense to give more information on this metric. Maybe you could say that a valid implementation of this is through
getsockopt
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a summary of how the metrics are collected and linked some references. Let me know if anything is unclear. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Fathom? A quick internet search yields https://dl.acm.org/doi/pdf/10.1145/3603269.3604815 which sounds related (but I have not read it)... Do you think anything that uses it is going to useful to anyone outside Google?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, and app-limitedness is also not a general property either. I think we should update to something like "Latest throughput measured of the TCP connection."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what spurious means in this context... maybe we could link to some write-up, or include some more text defining these things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find an authoritative source we can link, but I think we can include what's on go/fathom-metrics i.e. "These are retransmissions that TCP later discovered unnecessary.". The definition is public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I will add that in the description.