Skip to content

Commit

Permalink
Clarify metrics monotonicity (open-telemetry#1995)
Browse files Browse the repository at this point in the history
* Add supplementary doc about monotonicity

* rewrap

* reword the doc based on feedback

* explain how monotonicity could help reset detection

* improve the flow

* improve example

* minor fix

Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
  • Loading branch information
reyang and jmacd authored Oct 13, 2021
1 parent 16d5690 commit d7b82cd
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 5 deletions.
10 changes: 5 additions & 5 deletions specification/metrics/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -700,8 +700,8 @@ operation is provided by the `callback`, which is registered during the
`UpDownCounter` is a [synchronous Instrument](#synchronous-instrument) which
supports increments and decrements.

Note: if the value grows
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
Note: if the value is
[monotonically](https://wikipedia.org/wiki/Monotonic_function) increasing, use
[Counter](#counter) instead.

Example uses for `UpDownCounter`:
Expand Down Expand Up @@ -844,8 +844,8 @@ process heap size - it makes sense to report the heap size from multiple
processes and sum them up, so we get the total heap usage_) when the instrument
is being observed.

Note: if the value grows
[monotonically](https://wikipedia.org/wiki/Monotonic_function), use
Note: if the value is
[monotonically](https://wikipedia.org/wiki/Monotonic_function) increasing, use
[Asynchronous Counter](#asynchronous-counter) instead; if the value is
non-additive, use [Asynchronous Gauge](#asynchronous-gauge) instead.

Expand Down Expand Up @@ -886,7 +886,7 @@ The `callback` function is responsible for reporting the
observed. [OpenTelemetry API](../overview.md#api) authors SHOULD define whether
this callback function needs to be reentrant safe / thread safe or not.

Note: Unlike [UpDownCounter.Add()](#add) which takes the increment/delta value,
Note: Unlike [UpDownCounter.Add()](#add-1) which takes the increment/delta value,
the callback function reports the absolute value of the Asynchronous
UpDownCounter. To determine the reported rate the Asynchronous UpDownCounter is
changing, the difference between successive measurements is used.
Expand Down
75 changes: 75 additions & 0 deletions specification/metrics/supplementary-guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Table of Contents:
* [Guidelines for instrumentation library
authors](#guidelines-for-instrumentation-library-authors)
* [Instrument selection](#instrument-selection)
* [Additive property](#additive-property)
* [Monotonicity property](#monotonicity-property)
* [Semantic convention](#semantic-convention)
* [Guidelines for SDK authors](#guidelines-for-sdk-authors)
* [Aggregation temporality](#aggregation-temporality)
Expand Down Expand Up @@ -62,6 +64,79 @@ Here is one way of choosing the correct instrument:
* If the value is NOT monotonically increasing - use an [Asynchronous
UpDownCounter](./api.md#asynchronous-updowncounter).

### Additive property

### Monotonicity property

In the OpenTelemetry Metrics [Data Model](./datamodel.md) and [API](./api.md)
specifications, the word `monotonic` has been used frequently.

It is important to understand that different
[Instruments](#instrument-selection) handle monotonicity differently.

Let's take an example with a network driver using a [Counter](./api.md#counter)
to record the total number of bytes received:

* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
* no network packet has been received
* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
* received a packet with `30` bytes - `Counter.Add(30)`
* received a packet with `200` bytes - `Counter.Add(200)`
* received a packet with `50` bytes - `Counter.Add(50)`
* During the time range (T<sub>2</sub>, T<sub>3</sub>]
* received a packet with `100` bytes - `Counter.Add(100)`

You can see that the total increment during (T<sub>0</sub>, T<sub>1</sub>] is
`0`, the total increment during (T<sub>1</sub>, T<sub>2</sub>] is `280` (`30 +
200 + 50`), the total increment during (T<sub>2</sub>, T<sub>3</sub>] is `100`,
and the total increment during (T<s3ub>0</sub>, T<sub>3</sub>] is `380` (`0 +
280 + 100`). All the increments are non-negative, in other words, the **sum is
monotonically increasing**.

Note that it is inaccurate to say "the total bytes received by T<sub>3</sub> is
`380`", because there might be network packets received by the driver before we
started to observe it (e.g. before the last operating system reboot). The
accurate way is to say "the total bytes received during (T<sub>0</sub>,
T<sub>3</sub>] is `380`". In a nutshell, the count represents a **rate** which
is associated with a time range.

This monotonicity property is important because it gives the downstream systems
additional hints so they can handle the data in a better way. Imagine we report
the total number of bytes received in a cumulative sum data stream:

* At T<sub>n</sub>, we reported `3,896,473,820`.
* At T<sub>n+1</sub>, we reported `4,294,967,293`.
* At T<sub>n+2</sub>, we reported `1,800,372`.

The backend system could tell that there was integer overflow or system restart
during (T<sub>n+1</sub>, T<sub>n+2</sub>], so it has chance to "fix" the data.

Let's take another example with a process using an [Asynchronous
Counter](./api.md#asynchronous-counter) to report the total page faults of the
process:

The page faults are managed by the operating system, and the process could
retrieve the number of page faults via some system APIs.

* At T<sub>0</sub>:
* the process started
* the process didn't ask the operating system to report the page faults
* At T<sub>1</sub>:
* the operating system reported with `1000` page faults for the process
* At T<sub>2</sub>:
* the process didn't ask the operating system to report the page faults
* At T<sub>3</sub>:
* the operating system reported with `1050` page faults for the process
* At T<sub>4</sub>:
* the operating system reported with `1200` page faults for the process

You can see that the number being reported is the absolute value rather than
increments, and the value is monotonically increasing.

If we need to calculate "how many page faults have been introduced during
(T<sub>3</sub>, T<sub>4</sub>]", we need to apply subtraction `1200 - 1050 =
150`.

### Semantic convention

Once you decided [which instrument(s) to be used](#instrument-selection), you
Expand Down

0 comments on commit d7b82cd

Please sign in to comment.