-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skeleton for the remaining metrics instruments #1617
Changes from all commits
fe8f744
aa2fc82
d108a98
1d28064
db01229
a3e036f
aff8396
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,18 @@ Table of Contents | |
* [CounterFunc](#counterfunc) | ||
* [CounterFunc creation](#counterfunc-creation) | ||
* [CounterFunc operations](#counterfunc-operations) | ||
* [GaugeFunc](#gaugefunc) | ||
* [GaugeFunc creation](#gaugefunc-creation) | ||
* [GaugeFunc operations](#gaugefunc-operations) | ||
* [Histogram](#histogram) | ||
* [Histogram creation](#histogram-creation) | ||
* [Histogram operations](#histogram-operations) | ||
* [UpDownCounter](#updowncounter) | ||
* [UpDownCounter creation](#updowncounter-creation) | ||
* [UpDownCounter operations](#updowncounter-operations) | ||
* [UpDownCounterFunc](#updowncounter) | ||
* [UpDownCounterFunc creation](#updowncounter-creation) | ||
* [UpDownCounterFunc operations](#updowncounter-operations) | ||
* [Measurement](#measurement) | ||
|
||
</details> | ||
|
@@ -56,12 +68,16 @@ the metrics API: | |
| | ||
+-- Meter(name='io.opentelemetry.runtime', version='1.0.0') | ||
| | | ||
| +-- Instrument<GaugeFunc, int>(name='cpython.gc', attributes=['generation'], unit='kB') | ||
| | | ||
| +-- instruments... | ||
| | ||
+-- Meter(name='io.opentelemetry.contrib.mongodb.client', version='2.3.0') | ||
| | ||
+-- Instrument<Counter, int>(name='client.exception', attributes=['type'], unit='1') | ||
| | ||
+-- Instrument<Histogram, double>(name='client.duration', attributes=['net.peer.host', 'net.peer.port'], unit='ms') | ||
| | ||
+-- instruments... | ||
|
||
+-- MeterProvider(custom) | ||
|
@@ -425,6 +441,104 @@ var obCaesiumOscillates = meter.CreateCounterFunc<UInt64>("caesium_oscillates", | |
provided by the `callback`, which is registered during the [CounterFunc | ||
creation](#counterfunc-creation). | ||
|
||
### Histogram | ||
|
||
`Histogram` is a synchronous Instrument which can be used to report arbitrary | ||
values that are likely to be statistically meaningful. It is intended for | ||
statistics such as histograms, summaries, and percentile. | ||
|
||
Example uses for `Histogram`: | ||
|
||
* the request duration | ||
* the size of the response payload | ||
|
||
#### Histogram creation | ||
|
||
TODO | ||
|
||
#### Histogram operations | ||
|
||
##### Record | ||
|
||
TODO | ||
|
||
### GaugeFunc | ||
|
||
`GaugeFunc` is an asynchronous Instrument which reports non-additive value(s) | ||
(_e.g. the room temperature - it makes no sense to report the temperature value | ||
from multiple rooms and sum them up_) when the instrument is being observed. | ||
|
||
Note: if the values are additive (_e.g. the process heap size - it makes sense | ||
to report the heap size from multiple processes and sum them up, so we get the | ||
total heap usage_), use [CounterFunc](#counterfunc) or | ||
[UpDownCounterFunc](#updowncounterfunc). | ||
|
||
Example uses for `GaugeFunc`: | ||
|
||
* the current room temperature | ||
* the CPU fan speed | ||
|
||
#### GaugeFunc creation | ||
|
||
TODO | ||
|
||
#### GaugeFunc operations | ||
|
||
`GaugeFunc` is only intended for asynchronous scenario. The only operation is | ||
provided by the `callback`, which is registered during the [GaugeFunc | ||
creation](#gaugefunc-creation). | ||
|
||
### UpDownCounter | ||
|
||
`UpDownCounter` is a synchronous Instrument which supports increments and | ||
decrements. | ||
|
||
Note: if the value grows | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to all these call-outs when to use other instruments. |
||
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use | ||
[Counter](#counter) instead. | ||
|
||
Example uses for `UpDownCounter`: | ||
|
||
* the number of active requests | ||
* the number of items in a queue | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As an example, lets say I have an app runs on a bunch of machines and each machine has a bunch of queues with incoming customer traffic (I'm guessing this is a common situation). I know that if any of these queues gets particularly large that is a bad sign, implying high latency and maybe failure in the near future because my service is receiving requests faster than it can process them. In my app I create an UpDownCounter and use the queue name as a label. Then in my monitoring dashboard I want to set up alerts and graphs that show me the maximum queue size (or maybe the 99 percentile of queue size) across my fleet. Can I do that or should I have chosen a different instrument for this use case? I think the underlying question may go more to the data model, but without answering it I don't know what to expect from the API. The data model refers to data streams having aggregate functions that work across both temporal and spatial dimensions. Does that mean the aggregation is the same no matter which dimension I am aggregating across? In this example I picked UpDownCounter because it is summing over time but across the queue dimension I don't want a sum, I want a maximum or a percentile. It wasn't clear if there was any way to express this. I've probably got various follow up questions, but I need to figure this one out first : ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is my take:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks Reiley! This leads me to a few follow ups:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There can be performance implications of using The consumer of a stream of non-monotonic deltas is able to calculate a smoothed rate over windows of time, but has to start reading and keeping state itself from beginning of time in order to extract actual queue sizes in these examples. The reason users will sometimes prefer I occasionally talk about a "Stateless" exporter configuration for OTLP exporters and OTel SDKs that outputs deltas instead of cumulative sums, thus allowing the SDK to release memory. It makes sense to have a stateless option for all the instruments except There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jmacd - Thanks! Your answer makes me wonder a few things:
These scenarios are places where ideally the SDK does not pay the cost of maintaining the state up-front, but it can enabled on demand. However since it is impossible to recover the cummulative state of an UpDownCounter the SDK is forced to either track everything just in case or fail any request to start tracking on demand. However if the developer had used an instrument where they provided the size of the queue directly rather than its delta (sync sum, sync Gauge, async sum, async Gauge) there would be no problem enabling these metrics on demand. As the developer I'd expect it is just as easy for me to record total queue size as it is to record the delta in queue size so if one of these options gives me more capabilities than the other then picking UpDownCounter sounds like it would be a worse choice and I'll never want to pick it. (Also I still think the follow up questions in my 2nd post on this thread remain unanswered, apologies if I am being dense and this was intended to answer them as apparently I didn't understand that connection) |
||
|
||
#### UpDownCounter creation | ||
|
||
TODO | ||
|
||
#### UpDownCounter operations | ||
|
||
##### Add | ||
reyang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
TODO | ||
|
||
### UpDownCounterFunc | ||
|
||
`UpDownCounterFunc` is an asynchronous Instrument which reports additive | ||
value(s) (_e.g. the process heap size - it makes sense to report the heap size | ||
from multiple processes and sum them up, so we get the total heap usage_) when | ||
the instrument is being observed. | ||
|
||
Note: if the value grows | ||
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use | ||
[CounterFunc](#counterfunc) instead; if the value is non-additive, use | ||
[GaugeFunc](#gaugefunc) instead. | ||
|
||
Example uses for `UpDownCounterFunc`: | ||
|
||
* the process heap size | ||
* the approximate number of items in a lock-free circular buffer | ||
|
||
#### UpDownCounterFunc creation | ||
|
||
TODO | ||
|
||
#### UpDownCounterFunc operations | ||
|
||
`UpDownCounterFunc` is only intended for asynchronous scenario. The only operation is | ||
provided by the `callback`, which is registered during the [UpDownCounterFunc | ||
creation](#updowncounterfunc-creation). | ||
|
||
## Measurement | ||
|
||
A `Measurement` represents a data point reported via the metrics API to the SDK. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a Gauge not needed?
Gauge
is a synchronous Instrument which reports non-additive values(s)The example would be to report the room temperature when I finished an expensive operation. I have "context" as I'm associating my operation to the room temperature. If I only used GaugeFunc with a callback, I cannot provide the correct timing context as well as operation context.
Then, we have a Distribution. Am I suppose to use that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you explain why the room temperature is associated with operation context? Or consider finding another scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing room temperature to CPU temperature. I want to record the CPU temperature after I run a set of benchmark tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you? I think this is kind of an artificial use-case. If you really want that you can save the value and return it in the func for the moment.
Also probably at that point you want the distribution of temperatures after multiple runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to understand why this scenario and/or pattern is artificial or unwanted. The pattern involves doing an operation (where I have my context) and I want to report a value (that is not additive).
i.e.
Are we saying I should use the sync Distribution instrument for these?
This is true. However, current design requires the callback to report for same time. This presents a problem trying to report multiple operations from one (async) collection period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will admit that I have trouble finding many real-life examples of a synchronous gauge. The most common use of the term, I think comes from reading the speedometer of a car. This reading is instantaneous, you didn't make any request to your dashboard.
For real-life examples, I'm thinking about reading an electrical meter. You are an energy company and need to send people or RPCs around to each meter to get a reading of the current usage. As this request makes it's way into the meter, it has context: which part of the grid it's on, where's the substation, who's taking the measurement, and so on. When it reaches the meter, it might see Voltage (a true Gauge) and kWh used (UpDownSum) and want to Record them.
I've introduced a conflict--we're talking about adding a synchronous Gauge (to set Voltage) but we have not been not talking about adding a synchronous UpDownSum instrument, for synchronously making cumulative observations (discussed as long ago as open-telemetry/oteps#88), and I hope that we don't.
So why is it important to offer synchronous true Gauge? Maybe because of historical precedent. However, it OTLP is going to embrace its Non-Monotonic Cumulative Sum point, which is often confused with Gauge, then we better have a good answer as to why we're not introducing a Synchronous Non-Monotonic Cumulative Sum instrument. This would be the instrument used by the meter reader at my house, since I have solar panels. 😀