Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A66: OpenTelemetry Metrics #380

Merged
merged 31 commits into from
Sep 28, 2023
Merged

A66: OpenTelemetry Metrics #380

merged 31 commits into from
Sep 28, 2023

Conversation

yashykt
Copy link
Member

@yashykt yashykt commented Jul 21, 2023

No description provided.

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start!

Please let me know if you have any questions. Thanks!

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
Copy link

@joybestourous joybestourous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for getting started on this!

A66-otel-stats.md Show resolved Hide resolved
A66-otel-stats.md Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
@akshayjshah
Copy link

I was a little surprised that this proposal doesn't reference the existing OpenTelemetry specification for RPC metrics, especially since that doc includes a section just for gRPC. Fully acknowledging that the OTel doc is still experimental, could this proposal adopt/extend OTel's semantic conventions rather than starting from scratch?

@yashykt
Copy link
Member Author

yashykt commented Jul 31, 2023

I was a little surprised that this proposal doesn't reference the existing OpenTelemetry specification for RPC metrics, especially since that doc includes a section just for gRPC. Fully acknowledging that the OTel doc is still experimental, could this proposal adopt/extend OTel's semantic conventions rather than starting from scratch?

@akshayjshah There were a bunch of conversations about this with the OTel folks, and the understanding was that it would be hard for different RPC systems to export a single set of metrics given the different nuances that each protocol would have. For example, gRPC semantics make a difference between attempts and calls. There are also differences between the various metrics exporting compressed/uncompressed sizes.

@akshayjshah
Copy link

@akshayjshah There were a bunch of conversations about this with the OTel folks, and the understanding was that it would be hard for different RPC systems to export a single set of metrics given the different nuances that each protocol would have.

I'm poking through issues on grpc/grpc and grpc/proposals, and I'm having a hard time finding these conversations. Were they in a place that's public, and do you have any links? There are lot of places to search, so I'm probably just looking in the wrong spot.

For example, gRPC semantics make a difference between attempts and calls. There are also differences between the various metrics exporting compressed/uncompressed sizes.

Every RPC (and REST) framework I'm aware of supports compression, so this doesn't strike me as gRPC-specific. Amending the OTel metrics schema to distinguish compressed from uncompressed sizes (and clarifying that protocol-specific framing shouldn't be counted) would be a nice upstream improvement. If gRPC's notion of "attempts" is unique, it seems reasonable to handle that as an additive set of metrics beyond what's in the OTel semantic conventions.

Many of the metrics proposed here - e.g., call latency - do have direct analogs in other systems. If the gRPC team feels that the metrics schema designed by the OpenTelemetry committers isn't appropriate, it would be nice to explain why in the Background and Related Proposals sections of this document. As written, this proposal suggests (by omission) that there's no prior art for collecting OpenTelemetry metrics from gRPC.

@yashykt
Copy link
Member Author

yashykt commented Aug 14, 2023

I'm poking through issues on grpc/grpc and grpc/proposals, and I'm having a hard time finding these conversations. Were they in a place that's public, and do you have any links? There are lot of places to search, so I'm probably just looking in the wrong spot.

No, these were on video conferences unfortunately. I'll tag @jsuereth as the OTel Metrics Conventions spokesperson.

If the gRPC team feels that the metrics schema designed by the OpenTelemetry committers isn't appropriate, it would be nice to explain why in the Background and Related Proposals sections of this document.

Thanks! Yeah, I'll update the rationale. I don't know why I skipped that part.

A66-otel-stats.md Outdated Show resolved Hide resolved
@yashykt
Copy link
Member Author

yashykt commented Aug 28, 2023

We initially added authority as an attribute to servers, but the value add is not clear at the moment, and there is a potential of opening the servers to attack since the clients can send arbitrary authority values. (We could potentially build out an API where servers provide the ability to register which authorities should be accepted by the stats plugin, but we punt this for the future, for when we get a more clear idea as to the benefit here.)

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
Comment on lines 84 to 85
* `grpc.target` : Canonicalized target URI used when creating gRPC Channel,
e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we decide this needs to have 3(+) slashes, or that it was fine to use the user's text if they leave out the authority, e.g. "dns:host:port", or that it doesn't matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an authoritative document on this, but I think it'd be nice to use the same format across languages. Core uses the three slash format even without authority.

Copy link
Member

@murgatroid99 murgatroid99 Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming doc lists the DNS name format as dns:[//authority/]host[:port], which implies to me that the slashes should only be included if there is an authority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but it prefixes that with Most gRPC implementations support the following URI schemes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is accepted by the library and what we consider "canonical" are two different things. Someone should make a gRFC for "canonical target URI strings" I guess

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that makes sense, or maybe this gRFC will be setting "precedence" for what canonical means

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
// provide an implementation of a MeterProvider. If the passed in Meter Provider
// does not have the view configured for an individual metric turned on, the API
// call in this component will create a default view for that metric.
func DialOption(mo MetricsOptions) grpc.DialOption {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DialOptionOptions? 😆

Seriously though I think we'll want this to take a struct that contains MetricsOptions instead so that we can include tracing options in the future. Or a name besides MetricsOptions that includes tracing options as well. Probably just Options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will scale this up to Options { metrics options, tracing options } (OpenCensus just had tracing options). Or should I change this and scale upward to not break users eventually (i.e. have an options struct with metrics options as a knob but no tracing options until tracing)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to not break users eventually

Since we're not going to 1.0 it until after it's all done, it doesn't really matter, but it's probably a good idea to figure out how it's all going to look (what knobs we know we need for all features) before we start implementing.

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Show resolved Hide resolved
A66-otel-stats.md Show resolved Hide resolved
yashykt added a commit to grpc/grpc that referenced this pull request Sep 22, 2023
Implements method attribute filtering as defined by the gRFC
grpc/proposal#380
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Show resolved Hide resolved
Copy link
Contributor

@DNVindhya DNVindhya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Few minor comments / clarifications.

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo precedence nit

Comment on lines 370 to 372
// Any implementation knobs (i.e. views, bounds) set in the passed in object
// take precedence over the API calls from the interface in this component
// (i.e. it will create default views for unset views).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this precedence logic is only mentioned in this Go docstring. Can you please mention this somewhere in top level API documentation or a section earlier in the document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement should actually be removed since we no longer API to override that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is a statement up above that says - Users of the gRPC OpenTelemetry plugin will use the OTel SDK's MeterProvider to [control the views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) and customize the metrics that will be exported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? I still call out into API. I thought we needed to since these metrics are unconditionally recorded? That docstring doesn't mention API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you use the OTel API, but there is no longer any knob to control which instrument gets created and recorded on. I think the statement "Any implementation knobs (i.e. views, bounds) set in the passed in object take precedence over the API calls from the interface in this component (i.e. it will create default views for unset views)." is just going to confuse people. The term "API" is pretty loaded in this context, and if we want to explain this we should probably mention the whole thing as stated above in the Language-Specific Details section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to mention it, instead of mentioning precedence, we should use the word "customize", same as the OTel documentation at https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view. Let me know if you want to update the text here.

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
OpenTelemetry plugin. Overall, the APIs should have the following capabilities -

* Allow installing multiple OpenTelemetry plugins.
* Implementations must provide an option to set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this mutate API? Would it be fine (e.g., Java) to just pass it to the constructor and then it's guaranteed to be set? I see that is the case in the Java section below, but that seems counter to this requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds fine. Whether it is a builder API or just passed through the constructor seems like detail. Do you prefer some different wording here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind here is that when we later add tracing and logging support, it is conceivably possible that a user may want to enable tracing or logging without enabling metrics. In that case, the API should not force them to specify a MeterProvider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java takes in an OpenTelemetry object on which you, I believe, you can choose whether to set MeterProvider or not, so yeah, both constructor and builder patterns work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a single unified API for the various OTel components. Gotcha. I didn't piece that together.

I think the way earlier discussion of "don't stabilize the API until Metric and Tracing are both implemented" applies. But this does seem fair for now.

If targetFilter is not set, target is recorded as is. */
public OpenTelemetryBuilder targetFilter(Predicate<String> targetFilter);

public OpenTelemetryModule build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't want simple "build" to install globally (although it's not clear if that is the intention here). I'm fine with fixing up this API afterward, as long as others are. The sketch is fair, but the spellings will probably morph a bit.

(Similarly, I don't think we'll use the method name openTelemetry() above.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds fine to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I avoided using global in the build method, because buildAndRegisterGlobal has different meaning wrt OpenTelemetry java. In OpenTelemetry it also sets the sdk instance invoking the instance as Global instance which I didn't want to inferred here as well.

I named method as openTelemetry() since we are taking an openTelemetry instance. How does openTelemetrySdk() or sdk() sound?

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

OpenTelemetry plugin. Overall, the APIs should have the following capabilities -

* Allow installing multiple OpenTelemetry plugins.
* Implementations must provide an option to set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind here is that when we later add tracing and logging support, it is conceivably possible that a user may want to enable tracing or logging without enabling metrics. In that case, the API should not force them to specify a MeterProvider.

A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
A66-otel-stats.md Outdated Show resolved Hide resolved
@yashykt
Copy link
Member Author

yashykt commented Sep 28, 2023

Thanks for all the reviews!

@kingcb11
Copy link

LGTM modulo precedence nit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.