Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skeleton for the remaining metrics instruments #1617

Merged
merged 7 commits into from
Apr 26, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 114 additions & 0 deletions specification/metrics/new_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,18 @@ Table of Contents
* [CounterFunc](#counterfunc)
* [CounterFunc creation](#counterfunc-creation)
* [CounterFunc operations](#counterfunc-operations)
* [GaugeFunc](#gaugefunc)
* [GaugeFunc creation](#gaugefunc-creation)
* [GaugeFunc operations](#gaugefunc-operations)
* [Histogram](#histogram)
* [Histogram creation](#histogram-creation)
* [Histogram operations](#histogram-operations)
* [UpDownCounter](#updowncounter)
* [UpDownCounter creation](#updowncounter-creation)
* [UpDownCounter operations](#updowncounter-operations)
* [UpDownCounterFunc](#updowncounter)
* [UpDownCounterFunc creation](#updowncounter-creation)
* [UpDownCounterFunc operations](#updowncounter-operations)
* [Measurement](#measurement)

</details>
Expand All @@ -56,12 +68,16 @@ the metrics API:
|
+-- Meter(name='io.opentelemetry.runtime', version='1.0.0')
| |
| +-- Instrument<GaugeFunc, int>(name='cpython.gc', attributes=['generation'], unit='kB')
| |
| +-- instruments...
|
+-- Meter(name='io.opentelemetry.contrib.mongodb.client', version='2.3.0')
|
+-- Instrument<Counter, int>(name='client.exception', attributes=['type'], unit='1')
|
+-- Instrument<Histogram, double>(name='client.duration', attributes=['net.peer.host', 'net.peer.port'], unit='ms')
|
+-- instruments...

+-- MeterProvider(custom)
Expand Down Expand Up @@ -425,6 +441,104 @@ var obCaesiumOscillates = meter.CreateCounterFunc<UInt64>("caesium_oscillates",
provided by the `callback`, which is registered during the [CounterFunc
creation](#counterfunc-creation).

### Histogram

`Histogram` is a synchronous Instrument which can be used to report arbitrary
values that are likely to be statistically meaningful. It is intended for
statistics such as histograms, summaries, and percentile.

Example uses for `Histogram`:

* the request duration
* the size of the response payload

#### Histogram creation

TODO

#### Histogram operations

##### Record

TODO

### GaugeFunc

`GaugeFunc` is an asynchronous Instrument which reports non-additive value(s)
Copy link
Contributor

@victlu victlu Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a Gauge not needed?

Gauge is a synchronous Instrument which reports non-additive values(s)

The example would be to report the room temperature when I finished an expensive operation. I have "context" as I'm associating my operation to the room temperature. If I only used GaugeFunc with a callback, I cannot provide the correct timing context as well as operation context.

Then, we have a Distribution. Am I suppose to use that instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you explain why the room temperature is associated with operation context? Or consider finding another scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about changing room temperature to CPU temperature. I want to record the CPU temperature after I run a set of benchmark tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you? I think this is kind of an artificial use-case. If you really want that you can save the value and return it in the func for the moment.

Also probably at that point you want the distribution of temperatures after multiple runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand why this scenario and/or pattern is artificial or unwanted. The pattern involves doing an operation (where I have my context) and I want to report a value (that is not additive).

i.e.

  • Report my CPU Temperature (value: Temp) after a benchmark test (context: test_name)
  • After processing an incoming request (context: Type of request) I want to report a status code (value: 1=success/2=failure/#=etc.)

Are we saying I should use the sync Distribution instrument for these?

If you really want that you can save the value and return it in the func for the moment.

This is true. However, current design requires the callback to report for same time. This presents a problem trying to report multiple operations from one (async) collection period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will admit that I have trouble finding many real-life examples of a synchronous gauge. The most common use of the term, I think comes from reading the speedometer of a car. This reading is instantaneous, you didn't make any request to your dashboard.

For real-life examples, I'm thinking about reading an electrical meter. You are an energy company and need to send people or RPCs around to each meter to get a reading of the current usage. As this request makes it's way into the meter, it has context: which part of the grid it's on, where's the substation, who's taking the measurement, and so on. When it reaches the meter, it might see Voltage (a true Gauge) and kWh used (UpDownSum) and want to Record them.

I've introduced a conflict--we're talking about adding a synchronous Gauge (to set Voltage) but we have not been not talking about adding a synchronous UpDownSum instrument, for synchronously making cumulative observations (discussed as long ago as open-telemetry/oteps#88), and I hope that we don't.

So why is it important to offer synchronous true Gauge? Maybe because of historical precedent. However, it OTLP is going to embrace its Non-Monotonic Cumulative Sum point, which is often confused with Gauge, then we better have a good answer as to why we're not introducing a Synchronous Non-Monotonic Cumulative Sum instrument. This would be the instrument used by the meter reader at my house, since I have solar panels. 😀

(_e.g. the room temperature - it makes no sense to report the temperature value
from multiple rooms and sum them up_) when the instrument is being observed.

Note: if the values are additive (_e.g. the process heap size - it makes sense
to report the heap size from multiple processes and sum them up, so we get the
total heap usage_), use [CounterFunc](#counterfunc) or
[UpDownCounterFunc](#updowncounterfunc).

Example uses for `GaugeFunc`:

* the current room temperature
* the CPU fan speed

#### GaugeFunc creation

TODO

#### GaugeFunc operations

`GaugeFunc` is only intended for asynchronous scenario. The only operation is
provided by the `callback`, which is registered during the [GaugeFunc
creation](#gaugefunc-creation).

### UpDownCounter

`UpDownCounter` is a synchronous Instrument which supports increments and
decrements.

Note: if the value grows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to all these call-outs when to use other instruments.

[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
[Counter](#counter) instead.

Example uses for `UpDownCounter`:

* the number of active requests
* the number of items in a queue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, lets say I have an app runs on a bunch of machines and each machine has a bunch of queues with incoming customer traffic (I'm guessing this is a common situation). I know that if any of these queues gets particularly large that is a bad sign, implying high latency and maybe failure in the near future because my service is receiving requests faster than it can process them. In my app I create an UpDownCounter and use the queue name as a label. Then in my monitoring dashboard I want to set up alerts and graphs that show me the maximum queue size (or maybe the 99 percentile of queue size) across my fleet. Can I do that or should I have chosen a different instrument for this use case?

I think the underlying question may go more to the data model, but without answering it I don't know what to expect from the API. The data model refers to data streams having aggregate functions that work across both temporal and spatial dimensions. Does that mean the aggregation is the same no matter which dimension I am aggregating across? In this example I picked UpDownCounter because it is summing over time but across the queue dimension I don't want a sum, I want a maximum or a percentile. It wasn't clear if there was any way to express this.

I've probably got various follow up questions, but I need to figure this one out first : )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my take:

  1. UpDownCounter is the proper instrument since "number of items" by nature allows Sum. The "number of items" can be added across temporal and spatial dimensions. Although depending on the scenario one might prefer other aggregations (e.g. "the total number of items in all queues" might make more sense if the queues are homogeneous, and will make less sense if each queue is serving a totally different purpose).
  2. My answer would be yes - one should be able to customize the consumption experience and do alerting based on maximum queue size (or 99 percentile queue size) without having to use a different instrument.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Reiley! This leads me to a few follow ups:

  1. My understanding of the data model spec is that there is a single aggregation function for data streams across spatial and time dimensions whereas your answer implies there can be different aggregation functions for different dimensions. Am I misreading the spec or if not how would this configuration work?
  2. I would guess that it is rare to find a scenario where summing is the desirable spatial re-aggregation function for every label dimension. For example summing over a machine name label or a process instance label implies all the alert levels would have to be reconfigured whenever the number of machines/processes in the distributed system changes. Does that match your expectations or am I not considering the right scenarios?
  3. Following (2), if the default configuration for an UpDownCounter is that it always sums when aggregated over any label and nearly all scenarios have labels that should not be summed over, this implies nearly everyone will need to change the configuration away from the default for their UpDownCounters. If GaugeFunc and UpDownCounterFunc only differ based on their default spatial aggregation behavior and most use-cases won't be able to use those defaults then for practical purposes is it fair to say that UpDownCounterFunc and GaugeFunc don't have any differences that are likely to impact the typical scenarios that would make use of them? (I'm not implying that redundancy automatically needs to be eliminated, I just want to understand what are the practical differences, if any)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be performance implications of using UpDownCounter in the way described here. Although we think of UpDownCounter as a way to encode non-monotonic deltas semantically, they are almost always used with in-process aggregation, thus exported in cumulative form.

The consumer of a stream of non-monotonic deltas is able to calculate a smoothed rate over windows of time, but has to start reading and keeping state itself from beginning of time in order to extract actual queue sizes in these examples.

The reason users will sometimes prefer UpDownCounter to its asynchronous form is that the Metric SDK can be configured to compute virtual queue sizes along multiple sets of dimensions, whereas an asynchronous callback will likely only be able to report on actual queue sizes.

I occasionally talk about a "Stateless" exporter configuration for OTLP exporters and OTel SDKs that outputs deltas instead of cumulative sums, thus allowing the SDK to release memory. It makes sense to have a stateless option for all the instruments except UpDownCounter because of the preceding argument--they're almost always useful aggregated from the beginning of time, and it SHOULD be the caller's responsibility if possible to maintain that state. Still, I'd like it to be possible for a stateless SDK to output UpDownCounter aggregations with delta temporality, provided the consumer can maintain lifetime state on behalf of the process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmacd - Thanks! Your answer makes me wonder a few things:

  1. Your answer felt detailed, but our current guidance/decision tree does not contain that level of detail. Are you worried that users applying a simple decision tree such as "Queue size is good to monitor with UpDownCounter" will have a bad outcome or you think the they will still be reasonably served by the simple advice?

  2. Thinking more about the requirement for the SDK to maintain state, this worries me about recommending people to use UpDownCounter in scenarios like this. I know that the common/simple scenario for monitoring involves enabling the SDK at process startup and having it track state for the lifetime of the service but if possible I'd like the API to support other scenarios such as:

  • Ops personel use a monitoring dashboard to reconfigure collection policies dynamically. New configuration is transmitted over the network to a monitored app which applies it without restarting.
  • Ad-hoc tools similar to perfmon on Windows or top on Linux want to dynamically start listening to a metric for a period of time, then stop.

These scenarios are places where ideally the SDK does not pay the cost of maintaining the state up-front, but it can enabled on demand. However since it is impossible to recover the cummulative state of an UpDownCounter the SDK is forced to either track everything just in case or fail any request to start tracking on demand. However if the developer had used an instrument where they provided the size of the queue directly rather than its delta (sync sum, sync Gauge, async sum, async Gauge) there would be no problem enabling these metrics on demand. As the developer I'd expect it is just as easy for me to record total queue size as it is to record the delta in queue size so if one of these options gives me more capabilities than the other then picking UpDownCounter sounds like it would be a worse choice and I'll never want to pick it.

(Also I still think the follow up questions in my 2nd post on this thread remain unanswered, apologies if I am being dense and this was intended to answer them as apparently I didn't understand that connection)


#### UpDownCounter creation

TODO

#### UpDownCounter operations

##### Add
reyang marked this conversation as resolved.
Show resolved Hide resolved

TODO

### UpDownCounterFunc

`UpDownCounterFunc` is an asynchronous Instrument which reports additive
value(s) (_e.g. the process heap size - it makes sense to report the heap size
from multiple processes and sum them up, so we get the total heap usage_) when
the instrument is being observed.

Note: if the value grows
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
[CounterFunc](#counterfunc) instead; if the value is non-additive, use
[GaugeFunc](#gaugefunc) instead.

Example uses for `UpDownCounterFunc`:

* the process heap size
* the approximate number of items in a lock-free circular buffer

#### UpDownCounterFunc creation

TODO

#### UpDownCounterFunc operations

`UpDownCounterFunc` is only intended for asynchronous scenario. The only operation is
provided by the `callback`, which is registered during the [UpDownCounterFunc
creation](#updowncounterfunc-creation).

## Measurement

A `Measurement` represents a data point reported via the metrics API to the SDK.
Expand Down