Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skeleton for the remaining metrics instruments #1617

Merged
merged 7 commits into from
Apr 26, 2021
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions specification/metrics/new_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,15 @@ Table of Contents
* [CounterFunc](#counterfunc)
* [CounterFunc creation](#counterfunc-creation)
* [CounterFunc operations](#counterfunc-operations)
* [Gauge](#gauge)
Copy link
Member

@bogdandrutu bogdandrutu Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents here:
I would avoid mixing increment/decrement operations which make sense only for a "sum", with "Gauge" (set operation). So I would call this an UpDownCounter that has only add/dec (or just add but allows negative values) operations, but no set. Also I've failed to get an answer for a use-case of a set operation for a sync instrument, maybe worth documenting the use-case or add a link if I missed that.

In my mind we can have:

  1. Counter + CounterFunc (callback based version of the Counter, however we decide to name it) - the result is a monotonic sum
  2. UpDownCounter + UpDownCounterFunc (callback based version of the UpDownCounter, however we decide to name it) - the result is a non-monotonic sum
  3. Sync instrument for reporting values that "grouping" makes the most sense for them. We can call it Distribution (the name is the result of the grouping, which does not fit very well for an Instrument, consider Grouper, ValueRecorder etc.). For the moment no clear use-case for an async version of this.
  4. Helper sync instrument for reporting "durations". Call it Timer? For the moment no clear use-case for an async version of this.
  5. Async instrument to report "independent" values (e.g. room temperature). Call it Gauge (ValueObserver)?

From data-model perspective it is important to distinguish between a non-monotonic sum and a "gauge" (a.k.a value that we don't know how to re-aggregate/merge). The example I had in mind is if users report "cpu usage" as a metric with one label "state" with possible values ["used", "free", "inactive"] and we want to compute a new "memory usage" with no labels, excluding "state=free" we will be able to do that because we know summing the value from state=used + state=inactive is correct, if this was a gauge we cannot do that automatically, we need to ask the user what aggregation to apply when merging points.

I do understand that this requirement leaks some internal details into the API, but without user telling us via a "contract" that this represents a non-monotonic sum, we cannot infer that information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed most parts of this comment, the only thing remaining (I believe) is the Timer.
I think Timer is not a blocker, it is more like a "nice to have" layer on top of distribution.

The PR is updated, please take another look.

Copy link
Member

@jonatan-ivanov jonatan-ivanov Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, a Gauge is naturally async and it has a set method. A synchronous scenario for a Gauge could be having a synchronous process and recording the last processed something in a gauge.

e.g.:

public updateCar(long id) {
    carRepository.update(id, ...);
    lastUpddatedCar.set(id);
}

In micrometer, a Gauge is always async but it can give you the backing state object that you can update:

AtomicLong lastUpdatedCar = registry.gauge("lastUpddatedCar", new AtomicLong(-1));
lastUpddatedCar.set(id);

Copy link
Member

@jonatan-ivanov jonatan-ivanov Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could somebody please help me to understand the nature of UpDownCounter? I suppose it will not publish a rate but it will publish the value just like a Gauge would, right?

If so, is it really needed?
There are two examples in the docs: "the number of active requests" and "the number of items in a queue". To me, both seem like a "Gauge-scenario", more than that the first one is an example for synchronously updating a Gauge in synchronous environments.

Let me give you an example for both:

AtomicInteger activeRequests = registry.gauge("numberGauge", new AtomicInteger(0));

public Response process(Request request) {
    activeRequests.increment();
    ...
    return response;
}

and

Queue<String> queue = new BlockingQueue();
Gauge.builder("queueSize", queue, Queue::size).register(registry);

or

Queue<String> queue = registry.gauge("queueSize", tags, new BlockingQueue<>(), Queue::size);

My ultimate question is, do we need all of these? Can't we just get along with a single Gauge like the above that can cover all of these scenarios (UpDownCounter, UpDownCounterFunc, Gauge, GaugeFunc)?

Copy link
Member Author

@reyang reyang Apr 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed with @jonatan-ivanov offline, here goes the summary why we need UpDownCounter (the name can be changed, but we do need such thing as a separate instrument):

  1. We need a sync instrument (which has access to the context/baggage) which could take delta value AND is not monotonic. We cannot squeeze this into an async instrument since we do not want to have any delta value in async instrument (we want async instrument to take absolute value instead).
  2. This instrument by nature (since it takes delta) would allow values to be added across dimensions (take the heap size of each worker process id, you will get the total heap size of all worker processes).
  3. This instrument does not fit Counter or Distribution.
  4. Calling it Gauge is probably a bad idea, since Gauge doesn't imply additive property.

@aabmass brought up an idea that we might be able to achieve this by having two Counters #1617 (comment). I want to see more feedback to his comment.

Copy link

@noahfalk noahfalk Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'd like the decision tree we discussed (and user story of "What instrument to pick") to be highlighted in one spot

Can I propose we go further and attempt to write out what the user documentation text on the OTel website will be? Hopefully it should feel like the instruments and names we are choosing make the story we tell in the documentation concise, consistent and easily understood. If we can't make it sound that way I'd take it as a warning sign. I'm glad to take the first stab at this or collaborate with someone else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang this is 💯

* [Gauge creation](#gauge-creation)
* [Gauge operations](#gauge-operations)
* [GaugeFunc](#gaugefunc)
* [GaugeFunc creation](#gaugefunc-creation)
* [GaugeFunc operations](#gaugefunc-operations)
* [Distribution](#distribution)
reyang marked this conversation as resolved.
Show resolved Hide resolved
* [Distribution creation](#distribution-creation)
* [Distribution operations](#distribution-operations)
* [Measurement](#measurement)

</details>
Expand All @@ -56,12 +65,16 @@ the metrics API:
|
+-- Meter(name='io.opentelemetry.runtime', version='1.0.0')
| |
| +-- Instrument<GaugeFunc, int>(name='cpython.gc', attributes=['generation'], unit='kB')
| |
| +-- instruments...
|
+-- Meter(name='io.opentelemetry.contrib.mongodb.client', version='2.3.0')
|
+-- Instrument<Counter, int>(name='client.exception', attributes=['type'], unit='1')
|
+-- Instrument<Distribution, double>(name='client.duration', attributes=['net.peer.host', 'net.peer.port'], unit='ms')
|
+-- instruments...

+-- MeterProvider(custom)
Expand Down Expand Up @@ -425,6 +438,78 @@ var obCaesiumOscillates = meter.CreateCounterFunc<UInt64>("caesium_oscillates",
provided by the `callback`, which is registered during the [CounterFunc
creation](#counterfunc-creation).

### Gauge
Copy link
Member Author

@reyang reyang Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've talked about the name "gauge" in the SIG meetings. Here are my notes:

  1. "Gauge" is a widely used, yet very confusing term, there are many metrics systems using the word "gauge", but each of them has slightly different meaning (e.g. Prometheus Gauge is different from Micrometer Gauge).
  2. We've already used the term "gauge" in the Data Model Specification, and it is for timeseries model rather than event model.

Copy link
Contributor

@victlu victlu Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just throwing this out here... I do not hold a strong opinion on this naming...

Wiki have these definitions for gauge

  • is a device used to make measurements or in order to **display** certain dimensional information.
  • a device that displays the measurement of a monitored system by the use of a needle or pointer that moves along a calibrated scale

In the physical world, these gauges always include an actual display for immediate use. In our world, that is not exactly how it works. Thus, maybe we are dealing with just the "sensor" part without the display?

Wiki has definition for sensor

  • In the broadest definition, a sensor is a device, module, machine, or subsystem whose purpose is to detect events or changes in its environment and send the information to other electronics, frequently a computer processor. A sensor is always used with other electronics.

Wiki has a list of sensors. They include terms like *phone, *monitor, *meter, *sensor, *detector, *gauge, *scope, counter, etc...

I'm thinking maybe a sensor is more accurate to what our code does.

Copy link
Member Author

@reyang reyang Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone else wants to explore the "sensor" name please comment here.

I personally think "sensor" is a very broadly used term, and it is not found from the famous metrics systems, so I would avoid introducing it to OpenTelemetry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up "instrument" and one of the definitions is "measuring device used to gauge". I've come to terms with "gauge" in the sense that was mentioned in last week's SIG meeting--it's an instrument of last resort, when either a Histogram or Counter are for some reason not appropriate, meaning for cases where you do not know the aggregation being applied.

"gauge" is a great term for cases where there is not a more specific instrument, and it's a great term for when there's not a better term. "Gauge" has several meanings in English--inspecific. Gauge instruments have many kinds of aggregation--inspecific.


`Gauge` is a synchronous Instrument which can be used for values that go up
and/or down. It is intended for monitoring things with natural upper bounds.

Note: never gauge something you can count with [Counter](#counter)!
reyang marked this conversation as resolved.
Show resolved Hide resolved

Example uses for `Gauge`:

* the number of active requests
* the number of items in a queue
reyang marked this conversation as resolved.
Show resolved Hide resolved

#### Gauge creation

TODO

#### Gauge operations

##### Add
reyang marked this conversation as resolved.
Show resolved Hide resolved

TODO

##### Set
reyang marked this conversation as resolved.
Show resolved Hide resolved

TODO

### GaugeFunc

`GaugeFunc` is an asynchronous Instrument which reports value(s) when the
instrument is being observed.

Note: if the value grows
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
[CounterFunc](#counterfunc) instead.

Example uses for `GaugeFunc`:

* the current temperature
* the process heap size
* the approximate number of items in a lock-free circular buffer

#### GaugeFunc creation

TODO

#### GaugeFunc operations

`GaugeFunc` is only intended for asynchronous scenario. The only operation is
provided by the `callback`, which is registered during the [GaugeFunc
creation](#gaugefunc-creation).

### Distribution

`Distribution` is a synchronous Instrument which can be used to report arbitrary
values. It is intended for statistics such as histograms, summaries, and
percentile.

Example uses for `Distribution`:

* the request duration
* the size of the response payload

#### Distribution creation

TODO

#### Distribution operations

##### Record

TODO

## Measurement

A `Measurement` represents a data point reported via the metrics API to the SDK.
Expand Down