Skip to content

Latest commit

 

History

History
676 lines (561 loc) · 31.6 KB

supplementary-guidelines.md

File metadata and controls

676 lines (561 loc) · 31.6 KB

Supplementary Guidelines

Note: this document is NOT a spec, it is provided to support the Metrics API and SDK specifications, it does NOT add any extra requirements to the existing specifications.

Table of Contents

Guidelines for instrumentation library authors

Instrument selection

The Instruments are part of the Metrics API. They allow Measurements to be recorded synchronously or asynchronously.

Choosing the correct instrument is important, because:

  • It helps the library to achieve better efficiency. For example, if we want to report room temperature to Prometheus, we want to consider using an Asynchronous Gauge rather than periodically poll the sensor, so that we only access the sensor when scraping happened.
  • It makes the consumption easier for the user of the library. For example, if we want to report HTTP server request latency, we want to consider a Histogram, so most of the users can get a reasonable experience (e.g. default buckets, min/max) by simply enabling the metrics stream, rather than doing extra configurations.
  • It generates clarity to the semantic of the metrics stream, so the consumers have better understanding of the results. For example, if we want to report the process heap size, by using an Asynchronous UpDownCounter rather than an Asynchronous Gauge, we've made it explicit that the consumer can add up the numbers across all processes to get the "total heap size".

Here is one way of choosing the correct instrument:

  • I want to count something (by recording a delta value):
    • If the value is monotonically increasing (the delta value is always non-negative) - use a Counter.
    • If the value is NOT monotonically increasing (the delta value can be positive, negative or zero) - use an UpDownCounter.
  • I want to record or time something, and the statistics about this thing are likely to be meaningful - use a Histogram.
  • I want to measure something (by reporting an absolute value):

Additive property

In OpenTelemetry a Measurement encapsulates a value and a set of Attributes. Depending on the nature of the measurements, they can be additive, non-additive or somewhere in the middle. Here are some examples:

  • The server temperature is non-additive. The temperatures in the table below add up to 226.2, but this value has no practical meaning.

    Host Name Temperature (F)
    MachineA 58.8
    MachineB 86.1
    MachineC 81.3
  • The mass of planets is additive, the value 1.18e25 (3.30e23 + 6.42e23 + 4.87e24 + 5.97e24) means the combined mass of terrestrial planets in the solar system.

    Planet Name Mass (kg)
    Mercury 3.30e23
    Mars 6.42e23
    Venus 4.87e24
    Earth 5.97e24
  • The voltage of battery cells can be added up if the batteries are connected in series. However, if the batteries are connected in parallel, it makes no sense to add up the voltage values anymore.

In OpenTelemetry, each Instrument implies whether it is additive or not.

Instrument Additive
Counter additive
UpDownCounter additive
Histogram mixed1
Asynchronous Gauge non-additive
Asynchronous Counter additive
Asynchronous UpDownCounter additive

1: The Histogram bucket counts are additive if the buckets are the same, the sum is additive, but the min and max are non-additive.

Numeric type selection

For Instruments which take increments and/or decrements as the input (e.g. Counter and UpDownCounter), the underlying numeric types (e.g., signed integer, unsigned integer, double) have direct impact on the dynamic range, precision, and how the data is interpreted. Typically, integers are precise but have limited dynamic range, and might see overflow/underflow. IEEE-754 double-precision floating-point format has a wide dynamic range of numeric values with the sacrifice on precision.

Integer

Let's take an example: a 16-bit signed integer is used to count the committed transactions in a database, reported as cumulative sum every 15 seconds:

  • During (T0, T1], we reported 70.
  • During (T0, T2], we reported 115.
  • During (T0, T3], we reported 116.
  • During (T0, T4], we reported 128.
  • During (T0, T5], we reported 128.
  • During (T0, T6], we reported 173.
  • ...
  • During (T0, Tn+1], we reported 1,872.
  • During (Tn+2, Tn+3], we reported 35.
  • During (Tn+2, Tn+4], we reported 76.

In the above case, a backend system could tell that there was likely a system restart (because the start time has changed from T0 to Tn+2) during (Tn+1, Tn+2], so it has chance to adjust the data to:

  • (T0, Tn+3] : 1,907 (1,872 + 35).
  • (T0, Tn+4] : 1,948 (1,872 + 76).

Imagine we keep the database running:

  • During (T0, Tm+1], we reported 32,758.
  • During (T0, Tm+2], we reported 32,762.
  • During (T0, Tm+3], we reported -32,738.
  • During (T0, Tm+4], we reported -32,712.

In the above case, the backend system could tell that there was an integer overflow during (Tm+2, Tm+3] (because the start time remains the same as before, and the value becomes negative), so it has chance to adjust the data to:

  • (T0, Tm+3] : 32,798 (32,762 + 36).
  • (T0, Tm+4] : 32,824 (32,762 + 62).

As we can see in this example, even with the limitation of 16-bit integer, we can count the database transactions with high fidelity, without having to worry about information loss caused by integer overflows.

It is important to understand that we are handling counter reset and integer overflow/underflow based on the assumption that we've picked the proper dynamic range and reporting frequency. Imagine if we use the same 16-bit signed integer to count the transactions in a data center (which could have thousands if not millions of transactions per second), we wouldn't be able to tell if -32,738 was a result of 32,762 + 36 or 32,762 + 65,572 or even 32,762 + 131,108 if we report the data every 15 seconds. In this situation, either using a larger number (e.g. 32-bit integer) or increasing the reporting frequency (e.g. every microsecond, if we can afford the cost) would help.

Float

Let's take an example: an IEEE-754 double precision floating point is used to count the number of positrons detected by an alpha magnetic spectrometer. Each time a positron is detected, the spectrometer will invoke counter.Add(1), and the result is reported as cumulative sum every 1 second:

  • During (T0, T1], we reported 131,108.
  • During (T0, T2], we reported 375,463.
  • During (T0, T3], we reported 832,019.
  • During (T0, T4], we reported 1,257,308.
  • During (T0, T5], we reported 1,860,103.
  • ...
  • During (T0, Tn+1], we reported 9,007,199,254,325,789.
  • During (T0, Tn+2], we reported 9,007,199,254,740,992.
  • During (T0, Tn+3], we reported 9,007,199,254,740,992.

In the above case, the counter stopped increasing at some point between Tn+1 and Tn+2, because the IEEE-754 double counter is "saturated", 9,007,199,254,740,992 + 1 will result in 9,007,199,254,740,992 so the number stopped growing.

Note: in ECMAScript 6 the number 9,007,199,254,740,991 (2 ^ 53 - 1) is known as Number.MAX_SAFE_INTEGER, which is the maximum integer that can be exactly represented as an IEEE-754 double precision number, and whose IEEE-754 representation cannot be the result of rounding any other integer to fit the IEEE-754 representation.

In addition to the "saturation" issue, we should also understand that IEEE-754 double supports subnormal numbers. For example, 1.0E308 + 1.0E308 would result in +Inf (positive infinity). Certain metrics backend might have trouble handling subnormal numbers.

Monotonicity property

In the OpenTelemetry Metrics Data Model and API specifications, the word monotonic has been used frequently.

It is important to understand that different Instruments handle monotonicity differently.

Let's take an example with a network driver using a Counter to record the total number of bytes received:

  • During the time range (T0, T1]:
    • no network packet has been received
  • During the time range (T1, T2]:
    • received a packet with 30 bytes - Counter.Add(30)
    • received a packet with 200 bytes - Counter.Add(200)
    • received a packet with 50 bytes - Counter.Add(50)
  • During the time range (T2, T3]
    • received a packet with 100 bytes - Counter.Add(100)

You can see that the total increment during (T0, T1] is 0, the total increment during (T1, T2] is 280 (30 + 200 + 50), the total increment during (T2, T3] is 100, and the total increment during (T0, T3] is 380 (0 + 280 + 100). All the increments are non-negative, in other words, the sum is monotonically increasing.

Note that it is inaccurate to say "the total bytes received by T3 is 380", because there might be network packets received by the driver before we started to observe it (e.g. before the last operating system reboot). The accurate way is to say "the total bytes received during (T0, T3] is 380". In a nutshell, the count represents a rate which is associated with a time range.

This monotonicity property is important because it gives the downstream systems additional hints so they can handle the data in a better way. Imagine we report the total number of bytes received in a cumulative sum data stream:

  • At Tn, we reported 3,896,473,820.
  • At Tn+1, we reported 4,294,967,293.
  • At Tn+2, we reported 1,800,372.

The backend system could tell that there was integer overflow or system restart during (Tn+1, Tn+2], so it has chance to "fix" the data. Refer to additive property for more information about integer overflow.

Let's take another example with a process using an Asynchronous Counter to report the total page faults of the process:

The page faults are managed by the operating system, and the process could retrieve the number of page faults via some system APIs.

  • At T0:
    • the process started
    • the process didn't ask the operating system to report the page faults
  • At T1:
    • the operating system reported with 1000 page faults for the process
  • At T2:
    • the process didn't ask the operating system to report the page faults
  • At T3:
    • the operating system reported with 1050 page faults for the process
  • At T4:
    • the operating system reported with 1200 page faults for the process

You can see that the number being reported is the absolute value rather than increments, and the value is monotonically increasing.

If we need to calculate "how many page faults have been introduced during (T3, T4]", we need to apply subtraction 1200 - 1050 = 150.

Semantic convention

Once you decided which instrument(s) to be used, you will need to decide the names for the instruments and attributes.

It is highly recommended that you align with the OpenTelemetry Semantic Conventions, rather than inventing your own semantics.

Guidelines for SDK authors

Aggregation temporality

Synchronous example

The OpenTelemetry Metrics Data Model and SDK are designed to support both Cumulative and Delta Temporality. It is important to understand that temporality will impact how the SDK could manage memory usage. Let's take the following HTTP requests example:

  • During the time range (T0, T1]:
    • verb = GET, status = 200, duration = 50 (ms)
    • verb = GET, status = 200, duration = 100 (ms)
    • verb = GET, status = 500, duration = 1 (ms)
  • During the time range (T1, T2]:
    • no HTTP request has been received
  • During the time range (T2, T3]
    • verb = GET, status = 500, duration = 5 (ms)
    • verb = GET, status = 500, duration = 2 (ms)
  • During the time range (T3, T4]:
    • verb = GET, status = 200, duration = 100 (ms)
  • During the time range (T4, T5]:
    • verb = GET, status = 200, duration = 100 (ms)
    • verb = GET, status = 200, duration = 30 (ms)
    • verb = GET, status = 200, duration = 50 (ms)

Note that in the following examples, Delta aggregation temporality is discussed before Cumulative aggregation temporality because synchronous Counter and UpDownCounter measurements are input to the API with specified Delta aggregation temporality.

Synchronous example: Delta aggregation temporality

Let's imagine we export the metrics as Histogram, and to simplify the story we will only have one histogram bucket (-Inf, +Inf):

If we export the metrics using Delta Temporality:

  • (T0, T1]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T1, T2]
    • nothing since we don't have any Measurement received
  • (T2, T3]
    • attributes: {verb = GET, status = 500}, count: 2, min: 2 (ms), max: 5 (ms)
  • (T3, T4]
    • attributes: {verb = GET, status = 200}, count: 1, min: 100 (ms), max: 100 (ms)
  • (T4, T5]
    • attributes: {verb = GET, status = 200}, count: 3, min: 30 (ms), max: 100 (ms)

You can see that the SDK only needs to track what has happened after the latest collection/export cycle. For example, when the SDK started to process measurements in (T1, T2], it can completely forget about what has happened during (T0, T1].

Synchronous example: Cumulative aggregation temporality

If we export the metrics using Cumulative Temporality:

  • (T0, T1]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T0, T2]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T0, T3]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)
  • (T0, T4]
    • attributes: {verb = GET, status = 200}, count: 3, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)
  • (T0, T5]
    • attributes: {verb = GET, status = 200}, count: 6, min: 30 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)

You can see that we are performing Delta->Cumulative conversion, and the SDK has to track what has happened prior to the latest collection/export cycle, in the worst case, the SDK will have to remember what has happened since the beginning of the process.

Imagine if we have a long running service and we collect metrics with 7 attributes and each attribute can have 30 different values. We might eventually end up having to remember the complete set of all 21,870,000,000 permutations! This cardinality explosion is a well-known challenge in the metrics space.

Making it even worse, if we export the permutations even if there are no recent updates, the export batch could become huge and will be very costly. For example, do we really need/want to export the same thing for (T0, T2] in the above case?

So here are some suggestions that we encourage SDK implementers to consider:

  • You want to control the memory usage rather than allow it to grow indefinitely / unbounded - regardless of what aggregation temporality is being used.
  • You want to improve the memory efficiency by being able to forget about things that are no longer needed.
  • You probably don't want to keep exporting the same thing over and over again, if there is no updates. You might want to consider Resets and Gaps. For example, if a Cumulative metrics stream hasn't received any updates for a long period of time, would it be okay to reset the start time?

Asynchronous example

In the above case, we have Measurements reported by a Histogram Instrument. What if we collect measurements from an Asynchronous Counter?

The following example shows the number of page faults of each process since it started:

  • During the time range (T0, T1]:
    • pid = 1001, #PF = 50
    • pid = 1002, #PF = 30
  • During the time range (T1, T2]:
    • pid = 1001, #PF = 53
    • pid = 1002, #PF = 38
  • During the time range (T2, T3]
    • pid = 1001, #PF = 56
    • pid = 1002, #PF = 42
  • During the time range (T3, T4]:
    • pid = 1001, #PF = 60
    • pid = 1002, #PF = 47
  • During the time range (T4, T5]:
    • process 1001 died, process 1003 started
    • pid = 1002, #PF = 53
    • pid = 1003, #PF = 5
  • During the time range (T5, T6]:
    • A new process 1001 started
    • pid = 1001, #PF = 10
    • pid = 1002, #PF = 57
    • pid = 1003, #PF = 8

Note that in the following examples, Cumulative aggregation temporality is discussed before Delta aggregation temporality because asynchronous Counter and UpDownCounter measurements are input to the API with specified Cumulative aggregation temporality.

Asynchronous example: Cumulative temporality

If we export the metrics using Cumulative Temporality:

  • (T0, T1]
    • attributes: {pid = 1001}, sum: 50
    • attributes: {pid = 1002}, sum: 30
  • (T0, T2]
    • attributes: {pid = 1001}, sum: 53
    • attributes: {pid = 1002}, sum: 38
  • (T0, T3]
    • attributes: {pid = 1001}, sum: 56
    • attributes: {pid = 1002}, sum: 42
  • (T0, T4]
    • attributes: {pid = 1001}, sum: 60
    • attributes: {pid = 1002}, sum: 47
  • (T0, T5]
    • attributes: {pid = 1002}, sum: 53
  • (T4, T5]
    • attributes: {pid = 1003}, sum: 5
  • (T5, T6]
    • attributes: {pid = 1001}, sum: 10
  • (T0, T6]
    • attributes: {pid = 1002}, sum: 57
  • (T4, T6]
    • attributes: {pid = 1003}, sum: 8

The behavior in the first four periods is quite straightforward - we just take the data being reported from the asynchronous instruments and send them.

The data model prescribes several valid behaviors at T5 and T6 in this case, where one stream dies and another starts. The Resets and Gaps section describes how start timestamps and staleness markers can be used to increase the receiver's understanding of these events.

Consider whether the SDK maintains individual timestamps for the individual stream, or just one per process. In this example, where a process can die and restart, it starts counting page faults from zero. In this case, the valid behaviors at T5 and T6 are:

  1. If all streams in the process share a start time, and the SDK is not required to remember all past streams: the thread restarts with zero sum, and the start time of the process. Receivers with reset detection are able to calculate a correct rate (except for frequent restarts relative to the collection interval), however the precise time of a reset will be unknown.
  2. If the SDK maintains per-stream start times, it provides the previous callback time as the start time, as this time is before the occurrence of any events which are measured during the subsequent callback. This makes the first observation in a stream more useful for diagnostics, as downstream consumers can perform overlap detection or duplicate suppression and do not require reset detection in this case.
  3. Independent of above treatments, the SDK can add a staleness marker to indicate the start of a gap in the stream when one thread dies by remembering which streams have previously reported but are not currently reporting. If per-stream start timestamps are used, staleness markers can be issued to precisely start a gap in the stream and permit forgetting streams that have stopped reporting.

It's OK to ignore the options to use per-stream start timestamps and staleness markers. The first course of action above requires no additional memory or code to achieve and is correct in terms of the data model.

Asynchronous example: Delta temporality

If we export the metrics using Delta Temporality:

  • (T0, T1]
    • attributes: {pid = 1001}, delta: 50
    • attributes: {pid = 1002}, delta: 30
  • (T1, T2]
    • attributes: {pid = 1001}, delta: 3
    • attributes: {pid = 1002}, delta: 8
  • (T2, T3]
    • attributes: {pid = 1001}, delta: 3
    • attributes: {pid = 1002}, delta: 4
  • (T3, T4]
    • attributes: {pid = 1001}, delta: 4
    • attributes: {pid = 1002}, delta: 5
  • (T4, T5]
    • attributes: {pid = 1002}, delta: 6
    • attributes: {pid = 1003}, delta: 5
  • (T5, T6]
    • attributes: {pid = 1001}, delta: 10
    • attributes: {pid = 1002}, delta: 4
    • attributes: {pid = 1003}, delta: 3

You can see that we are performing Cumulative->Delta conversion, and it requires us to remember the last value of every single permutation we've encountered so far, because if we don't, we won't be able to calculate the delta value using current value - last value. And as you can tell, this is super expensive.

Making it more interesting, if we have min/max value, it is mathematically impossible to reliably deduce the Delta temporality from Cumulative temporality. For example:

  • If the maximum value is 10 during (T0, T2] and the maximum value is 20 during (T0, T3], we know that the maximum value during (T2, T3] must be 20.
  • If the maximum value is 20 during (T0, T2] and the maximum value is also 20 during (T0, T3], we wouldn't know what the maximum value is during (T2, T3], unless we know that there is no value (count = 0).

So here are some suggestions that we encourage SDK implementers to consider:

  • If you have to do Cumulative->Delta conversion, and you encountered min/max, rather than drop the data on the floor, you might want to convert them to something useful - e.g. Gauge.
Asynchronous example: attribute removal in a view

Suppose the metrics in the asynchronous example above are exported through a view configured to remove the pid attribute, leaving a count of page faults. For each metric stream, two measurements are produced covering the same interval of time, which the SDK is expected to aggregate before producing the output.

The data model specifies to use the "natural merge" function, in this case meaning to add the current point values together because they are Sum data points. The expected output is, still in Cumulative Temporality:

  • (T0, T1]
    • dimensions: {}, sum: 80
  • (T0, T2]
    • dimensions: {}, sum: 91
  • (T0, T3]
    • dimensions: {}, sum: 98
  • (T0, T4]
    • dimensions: {}, sum: 107
  • (T0, T5]
    • dimensions: {}, sum: 58
  • (T0, T6]
    • dimensions: {}, sum: 75

As discussed in the asynchronous cumulative temporality example above, there are various treatments available for detecting resets. Even if the first course is taken, which means doing nothing, a receiver that follows the data model's rules for unknown start time and inserting true start times will calculate a correct rate in this case. The "58" received at T5 resets the stream - the change from "107" to "58" will register as a gap and rate calculations will resume correctly at T6. The rules for reset handling are provided so that the unknown portion of "58" that was counted reflected in the "107" at T4 is not double-counted at T5 in the reset.

If the option to use per-stream start timestamps is taken above, it lightens the duties of the receiver, making it possible to monitor gaps precisely and detect overlapping streams. When per-stream state is available, the SDK has several approaches for calculating Views available in the presence of attributes that stop reporting and then reset some time later:

  1. By remembering the cumulative value for all streams across the lifetime of the process, the cumulative sum will be correct despite attributes that come and go. The SDK has to detect per-stream resets itself in this case, otherwise the View will be calculated incorrectly.
  2. When the cost of remembering all streams attributes becomes too high, reset the View and all its state, give it a new start timestamp, and let the caller see a gap in the stream.

When considering this matter, note also that the metrics API has a recommendation for each asynchronous instrument: User code is recommended not to provide more than one Measurement with the same attributes in a single callback.. Consider whether the impact of user error in this regard will impact the correctness of the view. When maintaining per-stream state for the purpose of View correctness, SDK authors may want to consider detecting when the user makes duplicate measurements. Without checking for duplicate measurements, Views may be calculated incorrectly.

Memory management

Memory management is a wide topic, here we will only cover some of the most important things for OpenTelemetry SDK.

Choose a better design so the SDK has less things to be memorized, avoid keeping things in memory unless there is a must need. One good example is the aggregation temporality.

Design a better memory layout, so the storage is efficient and accessing the storage can be fast. This is normally specific to the targeting programming language and platform. For example, aligning the memory to the CPU cache line, keeping the hot memories close to each other, keeping the memory close to the hardware (e.g. non-paged pool, NUMA).

Pre-allocate and pool the memory, so the SDK doesn't have to allocate memory on-the-fly. This is especially useful to language runtimes that have garbage collectors, as it ensures the hot path in the code won't trigger garbage collection.

Limit the memory usage, and handle critical memory condition. The general expectation is that a telemetry SDK should not fail the application. This can be done via some cardinality-capping algorithm - e.g. start to combine/drop some data points when the SDK hits the memory limit, and provide a mechanism to report the data loss.

Provide configurations to the application owner. The answer to "what is an efficient memory usage" is ultimately depending on the goal of the application owner. For example, the application owners might want to spend more memory in order to keep more permutations of metrics attributes, or they might want to use memory aggressively for certain attributes that are important, and keep a conservative limit for attributes that are less important.