Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics prototype scenario #146
Metrics prototype scenario #146
Changes from 9 commits
f0f993b
0660044
1e9eaf1
c04673d
11c07e2
39dee6d
caa8bde
5dfb0e9
9346bae
93290f7
51bf6f5
9ba2d1e
e69b88a
dd40330
9e1666c
415c794
f4a6f5d
71fdfdd
992e0ab
eaa6a10
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This point (and the next point) beg the question:
Should we make ONE of the use cases be "hide OTEL behind another library to help users take advantage of OTEL telemetry-unfiication concepts, like Baggage + Resource". This could be Micrometer, OpenCensus, whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think this type of offline historical reporting is a good primary use case for the metrics API? Although I can envision a metrics API doing it I'd guess it is a better fit for a standard transaction database where there are stronger guarantees about data consistency and richer data, but potentially worse performance/availability. I think of metrics being focused on high availability and low latency which is more oriented towards diagnostics/live monitoring/alerting where the grocery would be looking for signs like:
Of course if I am looking at it with too narrow a lense then this example might be accomplishing exactly what it intends, expanding my understanding of what scenarios a metrics API is intended to support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see (1) and (2) as good use-cases for monitoring using a metrics API, but maybe not (3). Although the example feels like it fell out of a textbook, you could re-imagine the store as a Message-Queue consumer processing orders in a horizontally scalable store. Can we ask another form of query: "how many stores were in operation at a given time?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful in your tables to show "resource" with sub-tables for the components coming from resource (host name, process_id I assume)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "Process CPU Usage" really an HTTP server concern? I feel like this may be a general instrumentation concern, and the HTTP metrics should really be focused only on things HTTP libraries do. CPU consumption is more of a process-wide concern.
I'd suggest using
Active HTTP Connections
as the pull metric from the HTTP library.For Server Room Temperature, I love what it's trying to do, but I'm having trouble buying it's part of an http library. Maybe move it into its own instrumentation component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed CPU since itself is a complex topic and not the focus on this OTEP/prototype.
I've extracted temperature/humidity to a separate lib to make the scenario more realistic/reasonable.
For Active HTTP Connections, I don't know if we want to do it by reporting "total received - total processed" or doing it differently (see the discussion here). Given this discussion might take extra time, probably leave it out for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing about CPU is that it's a perfect example of an asynchronous instrument in the draft API. We want these to be recorded though callbacks, because it's expensive. This means the value is returned in cumulative form, not in delta form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will ask about how vendors can "enrich" a data point. I assume we have some "Processors" or extension points for vendors to enrich or alter data points (what is allowed/disallowed is also a discussion).
Separately, it would be good to clearly state who is responsible for providing the "Resource" labels? Is it an Exporter or a Processor, or Auto-instrumentation, or ??? Is this data available from beginning or added at end of the pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC - This is done in Exporters for Trace. If a vendor isn't directly supporting
Resouce
(or where their notion ofResource
doesn't fully align with OTLP), the vendor can liftResource
labels onto the trace in the export method. We actually plan to do this for instrumentation library (although I don't think we do it today).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are more like the SDK design question and implementation question. Probably not putting too much info in this doc since we're trying to cover the scenario/scope here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is something we want in initial stage, but i'd like to add a requirement of same metric being reported with different interval, potentially with diff. dimensions.
For example, the app owner wants to see HTTP Server Duration metric exported every 1 second with only HttpStatusCode dimension, and HTTP Server Duration metric exported every 30 seconds with dimensions {hostname, HTTP Method, Host, Status Code, Client Type}. The former is typically used for near-real-time dashboards, and the latter for more permanent storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example - I guess normally people would only report the 1 second one from the SDK pre-aggregation, and rely on the metrics backend to aggregate the 30 seconds one (and daily/weekly/monthly summary if there is a need).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we need to create multiple "pipelines" so that each pipeline can be configured individually. This implies IMHO that measurements reported from API needs to be routed to each individual pipelines.
Currently, I think this is an issue with our use of a "Default" provider in Library X, if Library X does not take a dependency on the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this example asks for three counters, because it seems possible to achieve with two instruments: a count of received requests and a histogram of response durations (i.e., seems to call for either a view or a 3rd instrument).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it might affect the semantic convention open-telemetry/opentelemetry-specification#1378 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Reply: open-telemetry/opentelemetry-specification#1378 (comment))