Note: this document is NOT a spec, it is provided to support the Metrics API and SDK specifications, it does NOT add any extra requirements to the existing specifications.
Table of Contents
The Instruments are part of the Metrics API. They allow Measurements to be recorded synchronously or asynchronously.
Choosing the correct instrument is important, because:
- It helps the library to achieve better efficiency. For example, if we want to report room temperature to Prometheus, we want to consider using an Asynchronous Gauge rather than periodically poll the sensor, so that we only access the sensor when scraping happened.
- It makes the consumption easier for the user of the library. For example, if we want to report HTTP server request latency, we want to consider a Histogram, so most of the users can get a reasonable experience (e.g. default buckets, min/max) by simply enabling the metrics stream, rather than doing extra configurations.
- It generates clarity to the semantic of the metrics stream, so the consumers have better understanding of the results. For example, if we want to report the process heap size, by using an Asynchronous UpDownCounter rather than an Asynchronous Gauge, we've made it explicit that the consumer can add up the numbers across all processes to get the "total heap size".
Here is one way of choosing the correct instrument:
- I want to count something (by recording a delta value):
- If the value is monotonically increasing (the delta value is always non-negative) - use a Counter.
- If the value is NOT monotonically increasing (the delta value can be positive, negative or zero) - use an UpDownCounter.
- I want to record or time something, and the statistics about this thing are likely to be meaningful - use a Histogram.
- I want to measure something (by reporting an absolute value):
- If the measurement values are non-additive, use an Asynchronous Gauge.
- If the measurement values are additive:
- If the value is monotonically increasing - use an Asynchronous Counter.
- If the value is NOT monotonically increasing - use an Asynchronous UpDownCounter.
In OpenTelemetry a Measurement encapsulates a value and
a set of Attributes
. Depending on the nature
of the measurements, they can be additive, non-additive or somewhere in the
middle. Here are some examples:
-
The server temperature is non-additive. The temperatures in the table below add up to
226.2
, but this value has no practical meaning.Host Name Temperature (F) MachineA 58.8 MachineB 86.1 MachineC 81.3 -
The mass of planets is additive, the value
1.18e25
(3.30e23 + 6.42e23 + 4.87e24 + 5.97e24
) means the combined mass of terrestrial planets in the solar system.Planet Name Mass (kg) Mercury 3.30e23 Mars 6.42e23 Venus 4.87e24 Earth 5.97e24 -
The voltage of battery cells can be added up if the batteries are connected in series. However, if the batteries are connected in parallel, it makes no sense to add up the voltage values anymore.
In OpenTelemetry, each Instrument implies whether it is additive or not.
Instrument | Additive |
---|---|
Counter | additive |
UpDownCounter | additive |
Histogram | mixed1 |
Asynchronous Gauge | non-additive |
Asynchronous Counter | additive |
Asynchronous UpDownCounter | additive |
1: The Histogram bucket counts are additive if the buckets are the same, the sum is additive, but the min and max are non-additive.
For Instruments which take increments and/or decrements as the input (e.g. Counter and UpDownCounter), the underlying numeric types (e.g., signed integer, unsigned integer, double) have direct impact on the dynamic range, precision, and how the data is interpreted. Typically, integers are precise but have limited dynamic range, and might see overflow/underflow. IEEE-754 double-precision floating-point format has a wide dynamic range of numeric values with the sacrifice on precision.
Let's take an example: a 16-bit signed integer is used to count the committed transactions in a database, reported as cumulative sum every 15 seconds:
- During (T0, T1], we reported
70
. - During (T0, T2], we reported
115
. - During (T0, T3], we reported
116
. - During (T0, T4], we reported
128
. - During (T0, T5], we reported
128
. - During (T0, T6], we reported
173
. - ...
- During (T0, Tn+1], we reported
1,872
. - During (Tn+2, Tn+3], we reported
35
. - During (Tn+2, Tn+4], we reported
76
.
In the above case, a backend system could tell that there was likely a system restart (because the start time has changed from T0 to Tn+2) during (Tn+1, Tn+2], so it has chance to adjust the data to:
- (T0, Tn+3] :
1,907
(1,872 + 35). - (T0, Tn+4] :
1,948
(1,872 + 76).
Imagine we keep the database running:
- During (T0, Tm+1], we reported
32,758
. - During (T0, Tm+2], we reported
32,762
. - During (T0, Tm+3], we reported
-32,738
. - During (T0, Tm+4], we reported
-32,712
.
In the above case, the backend system could tell that there was an integer overflow during (Tm+2, Tm+3] (because the start time remains the same as before, and the value becomes negative), so it has chance to adjust the data to:
- (T0, Tm+3] :
32,798
(32,762 + 36). - (T0, Tm+4] :
32,824
(32,762 + 62).
As we can see in this example, even with the limitation of 16-bit integer, we can count the database transactions with high fidelity, without having to worry about information loss caused by integer overflows.
It is important to understand that we are handling counter reset and integer
overflow/underflow based on the assumption that we've picked the proper dynamic
range and reporting frequency. Imagine if we use the same 16-bit signed integer
to count the transactions in a data center (which could have thousands if not
millions of transactions per second), we wouldn't be able to tell if -32,738
was a result of 32,762 + 36
or 32,762 + 65,572
or even 32,762 + 131,108
if
we report the data every 15 seconds. In this situation, either using a larger
number (e.g. 32-bit integer) or increasing the reporting frequency (e.g. every
microsecond, if we can afford the cost) would help.
Let's take an example: an IEEE-754 double precision floating
point is
used to count the number of positrons detected by an alpha magnetic
spectrometer. Each time a positron is detected, the spectrometer will invoke
counter.Add(1)
, and the result is reported as cumulative sum every 1 second:
- During (T0, T1], we reported
131,108
. - During (T0, T2], we reported
375,463
. - During (T0, T3], we reported
832,019
. - During (T0, T4], we reported
1,257,308
. - During (T0, T5], we reported
1,860,103
. - ...
- During (T0, Tn+1], we reported
9,007,199,254,325,789
. - During (T0, Tn+2], we reported
9,007,199,254,740,992
. - During (T0, Tn+3], we reported
9,007,199,254,740,992
.
In the above case, the counter stopped increasing at some point between
Tn+1 and Tn+2, because the IEEE-754 double counter is
"saturated", 9,007,199,254,740,992 + 1
will result in 9,007,199,254,740,992
so the number stopped growing.
Note: in ECMAScript 6 the number 9,007,199,254,740,991
(2 ^ 53 - 1
) is known
as Number.MAX_SAFE_INTEGER
, which is the maximum integer that can be exactly
represented as an IEEE-754 double precision number, and whose IEEE-754
representation cannot be the result of rounding any other integer to fit the
IEEE-754 representation.
In addition to the "saturation" issue, we should also understand that IEEE-754
double supports subnormal
numbers. For example,
1.0E308 + 1.0E308
would result in +Inf
(positive infinity). Certain metrics
backend might have trouble handling subnormal numbers.
In the OpenTelemetry Metrics Data Model and API
specifications, the word monotonic
has been used frequently.
It is important to understand that different Instruments handle monotonicity differently.
Let's take an example with a network driver using a Counter to record the total number of bytes received:
- During the time range (T0, T1]:
- no network packet has been received
- During the time range (T1, T2]:
- received a packet with
30
bytes -Counter.Add(30)
- received a packet with
200
bytes -Counter.Add(200)
- received a packet with
50
bytes -Counter.Add(50)
- received a packet with
- During the time range (T2, T3]
- received a packet with
100
bytes -Counter.Add(100)
- received a packet with
You can see that the total increment during (T0, T1] is
0
, the total increment during (T1, T2] is 280
(30 + 200 + 50
), the total increment during (T2, T3] is 100
,
and the total increment during (T0, T3] is 380
(0 + 280 + 100
). All the increments are non-negative, in other words, the sum is
monotonically increasing.
Note that it is inaccurate to say "the total bytes received by T3 is
380
", because there might be network packets received by the driver before we
started to observe it (e.g. before the last operating system reboot). The
accurate way is to say "the total bytes received during (T0,
T3] is 380
". In a nutshell, the count represents a rate which
is associated with a time range.
This monotonicity property is important because it gives the downstream systems additional hints so they can handle the data in a better way. Imagine we report the total number of bytes received in a cumulative sum data stream:
- At Tn, we reported
3,896,473,820
. - At Tn+1, we reported
4,294,967,293
. - At Tn+2, we reported
1,800,372
.
The backend system could tell that there was integer overflow or system restart during (Tn+1, Tn+2], so it has chance to "fix" the data. Refer to additive property for more information about integer overflow.
Let's take another example with a process using an Asynchronous Counter to report the total page faults of the process:
The page faults are managed by the operating system, and the process could retrieve the number of page faults via some system APIs.
- At T0:
- the process started
- the process didn't ask the operating system to report the page faults
- At T1:
- the operating system reported with
1000
page faults for the process
- the operating system reported with
- At T2:
- the process didn't ask the operating system to report the page faults
- At T3:
- the operating system reported with
1050
page faults for the process
- the operating system reported with
- At T4:
- the operating system reported with
1200
page faults for the process
- the operating system reported with
You can see that the number being reported is the absolute value rather than increments, and the value is monotonically increasing.
If we need to calculate "how many page faults have been introduced during
(T3, T4]", we need to apply subtraction 1200 - 1050 = 150
.
Once you decided which instrument(s) to be used, you will need to decide the names for the instruments and attributes.
It is highly recommended that you align with the OpenTelemetry Semantic Conventions
, rather than inventing your own semantics.
The OpenTelemetry Metrics Data Model and SDK are designed to support both Cumulative and Delta Temporality. It is important to understand that temporality will impact how the SDK could manage memory usage. Let's take the following HTTP requests example:
- During the time range (T0, T1]:
- verb =
GET
, status =200
, duration =50 (ms)
- verb =
GET
, status =200
, duration =100 (ms)
- verb =
GET
, status =500
, duration =1 (ms)
- verb =
- During the time range (T1, T2]:
- no HTTP request has been received
- During the time range (T2, T3]
- verb =
GET
, status =500
, duration =5 (ms)
- verb =
GET
, status =500
, duration =2 (ms)
- verb =
- During the time range (T3, T4]:
- verb =
GET
, status =200
, duration =100 (ms)
- verb =
- During the time range (T4, T5]:
- verb =
GET
, status =200
, duration =100 (ms)
- verb =
GET
, status =200
, duration =30 (ms)
- verb =
GET
, status =200
, duration =50 (ms)
- verb =
Note that in the following examples, Delta aggregation temporality is discussed before Cumulative aggregation temporality because synchronous Counter and UpDownCounter measurements are input to the API with specified Delta aggregation temporality.
Let's imagine we export the metrics as Histogram,
and to simplify the story we will only have one histogram bucket (-Inf, +Inf)
:
If we export the metrics using Delta Temporality:
- (T0, T1]
- attributes: {verb =
GET
, status =200
}, count:2
, min:50 (ms)
, max:100 (ms)
- attributes: {verb =
GET
, status =500
}, count:1
, min:1 (ms)
, max:1 (ms)
- attributes: {verb =
- (T1, T2]
- nothing since we don't have any Measurement received
- (T2, T3]
- attributes: {verb =
GET
, status =500
}, count:2
, min:2 (ms)
, max:5 (ms)
- attributes: {verb =
- (T3, T4]
- attributes: {verb =
GET
, status =200
}, count:1
, min:100 (ms)
, max:100 (ms)
- attributes: {verb =
- (T4, T5]
- attributes: {verb =
GET
, status =200
}, count:3
, min:30 (ms)
, max:100 (ms)
- attributes: {verb =
You can see that the SDK only needs to track what has happened after the latest collection/export cycle. For example, when the SDK started to process measurements in (T1, T2], it can completely forget about what has happened during (T0, T1].
If we export the metrics using Cumulative Temporality:
- (T0, T1]
- attributes: {verb =
GET
, status =200
}, count:2
, min:50 (ms)
, max:100 (ms)
- attributes: {verb =
GET
, status =500
}, count:1
, min:1 (ms)
, max:1 (ms)
- attributes: {verb =
- (T0, T2]
- attributes: {verb =
GET
, status =200
}, count:2
, min:50 (ms)
, max:100 (ms)
- attributes: {verb =
GET
, status =500
}, count:1
, min:1 (ms)
, max:1 (ms)
- attributes: {verb =
- (T0, T3]
- attributes: {verb =
GET
, status =200
}, count:2
, min:50 (ms)
, max:100 (ms)
- attributes: {verb =
GET
, status =500
}, count:3
, min:1 (ms)
, max:5 (ms)
- attributes: {verb =
- (T0, T4]
- attributes: {verb =
GET
, status =200
}, count:3
, min:50 (ms)
, max:100 (ms)
- attributes: {verb =
GET
, status =500
}, count:3
, min:1 (ms)
, max:5 (ms)
- attributes: {verb =
- (T0, T5]
- attributes: {verb =
GET
, status =200
}, count:6
, min:30 (ms)
, max:100 (ms)
- attributes: {verb =
GET
, status =500
}, count:3
, min:1 (ms)
, max:5 (ms)
- attributes: {verb =
You can see that we are performing Delta->Cumulative conversion, and the SDK has to track what has happened prior to the latest collection/export cycle, in the worst case, the SDK will have to remember what has happened since the beginning of the process.
Imagine if we have a long running service and we collect metrics with 7
attributes and each attribute can have 30 different values. We might eventually
end up having to remember the complete set of all 21,870,000,000
permutations!
This cardinality explosion is a well-known challenge in the metrics space.
Making it even worse, if we export the permutations even if there are no recent updates, the export batch could become huge and will be very costly. For example, do we really need/want to export the same thing for (T0, T2] in the above case?
So here are some suggestions that we encourage SDK implementers to consider:
- You want to control the memory usage rather than allow it to grow indefinitely / unbounded - regardless of what aggregation temporality is being used.
- You want to improve the memory efficiency by being able to forget about things that are no longer needed.
- You probably don't want to keep exporting the same thing over and over again, if there is no updates. You might want to consider Resets and Gaps. For example, if a Cumulative metrics stream hasn't received any updates for a long period of time, would it be okay to reset the start time?
In the above case, we have Measurements reported by a Histogram Instrument. What if we collect measurements from an Asynchronous Counter?
The following example shows the number of page faults of each process since it started:
- During the time range (T0, T1]:
- pid =
1001
, #PF =50
- pid =
1002
, #PF =30
- pid =
- During the time range (T1, T2]:
- pid =
1001
, #PF =53
- pid =
1002
, #PF =38
- pid =
- During the time range (T2, T3]
- pid =
1001
, #PF =56
- pid =
1002
, #PF =42
- pid =
- During the time range (T3, T4]:
- pid =
1001
, #PF =60
- pid =
1002
, #PF =47
- pid =
- During the time range (T4, T5]:
- process 1001 died, process 1003 started
- pid =
1002
, #PF =53
- pid =
1003
, #PF =5
- During the time range (T5, T6]:
- A new process 1001 started
- pid =
1001
, #PF =10
- pid =
1002
, #PF =57
- pid =
1003
, #PF =8
Note that in the following examples, Cumulative aggregation temporality is discussed before Delta aggregation temporality because asynchronous Counter and UpDownCounter measurements are input to the API with specified Cumulative aggregation temporality.
If we export the metrics using Cumulative Temporality:
- (T0, T1]
- attributes: {pid =
1001
}, sum:50
- attributes: {pid =
1002
}, sum:30
- attributes: {pid =
- (T0, T2]
- attributes: {pid =
1001
}, sum:53
- attributes: {pid =
1002
}, sum:38
- attributes: {pid =
- (T0, T3]
- attributes: {pid =
1001
}, sum:56
- attributes: {pid =
1002
}, sum:42
- attributes: {pid =
- (T0, T4]
- attributes: {pid =
1001
}, sum:60
- attributes: {pid =
1002
}, sum:47
- attributes: {pid =
- (T0, T5]
- attributes: {pid =
1002
}, sum:53
- attributes: {pid =
- (T4, T5]
- attributes: {pid =
1003
}, sum:5
- attributes: {pid =
- (T5, T6]
- attributes: {pid =
1001
}, sum:10
- attributes: {pid =
- (T0, T6]
- attributes: {pid =
1002
}, sum:57
- attributes: {pid =
- (T4, T6]
- attributes: {pid =
1003
}, sum:8
- attributes: {pid =
The behavior in the first four periods is quite straightforward - we just take the data being reported from the asynchronous instruments and send them.
The data model prescribes several valid behaviors at T5 and T6 in this case, where one stream dies and another starts. The Resets and Gaps section describes how start timestamps and staleness markers can be used to increase the receiver's understanding of these events.
Consider whether the SDK maintains individual timestamps for the individual stream, or just one per process. In this example, where a process can die and restart, it starts counting page faults from zero. In this case, the valid behaviors at T5 and T6 are:
- If all streams in the process share a start time, and the SDK is not required to remember all past streams: the thread restarts with zero sum, and the start time of the process. Receivers with reset detection are able to calculate a correct rate (except for frequent restarts relative to the collection interval), however the precise time of a reset will be unknown.
- If the SDK maintains per-stream start times, it provides the previous callback time as the start time, as this time is before the occurrence of any events which are measured during the subsequent callback. This makes the first observation in a stream more useful for diagnostics, as downstream consumers can perform overlap detection or duplicate suppression and do not require reset detection in this case.
- Independent of above treatments, the SDK can add a staleness marker to indicate the start of a gap in the stream when one thread dies by remembering which streams have previously reported but are not currently reporting. If per-stream start timestamps are used, staleness markers can be issued to precisely start a gap in the stream and permit forgetting streams that have stopped reporting.
It's OK to ignore the options to use per-stream start timestamps and staleness markers. The first course of action above requires no additional memory or code to achieve and is correct in terms of the data model.
If we export the metrics using Delta Temporality:
- (T0, T1]
- attributes: {pid =
1001
}, delta:50
- attributes: {pid =
1002
}, delta:30
- attributes: {pid =
- (T1, T2]
- attributes: {pid =
1001
}, delta:3
- attributes: {pid =
1002
}, delta:8
- attributes: {pid =
- (T2, T3]
- attributes: {pid =
1001
}, delta:3
- attributes: {pid =
1002
}, delta:4
- attributes: {pid =
- (T3, T4]
- attributes: {pid =
1001
}, delta:4
- attributes: {pid =
1002
}, delta:5
- attributes: {pid =
- (T4, T5]
- attributes: {pid =
1002
}, delta:6
- attributes: {pid =
1003
}, delta:5
- attributes: {pid =
- (T5, T6]
- attributes: {pid =
1001
}, delta:10
- attributes: {pid =
1002
}, delta:4
- attributes: {pid =
1003
}, delta:3
- attributes: {pid =
You can see that we are performing Cumulative->Delta conversion, and it requires
us to remember the last value of every single permutation we've encountered so
far, because if we don't, we won't be able to calculate the delta value using
current value - last value
. And as you can tell, this is super expensive.
Making it more interesting, if we have min/max value, it is mathematically impossible to reliably deduce the Delta temporality from Cumulative temporality. For example:
- If the maximum value is 10 during (T0, T2] and the maximum value is 20 during (T0, T3], we know that the maximum value during (T2, T3] must be 20.
- If the maximum value is 20 during (T0, T2] and the maximum value is also 20 during (T0, T3], we wouldn't know what the maximum value is during (T2, T3], unless we know that there is no value (count = 0).
So here are some suggestions that we encourage SDK implementers to consider:
- If you have to do Cumulative->Delta conversion, and you encountered min/max, rather than drop the data on the floor, you might want to convert them to something useful - e.g. Gauge.
Suppose the metrics in the asynchronous example above are exported
through a view configured to remove the pid
attribute, leaving a
count of page faults. For each metric stream, two measurements are produced
covering the same interval of time, which the SDK is expected to aggregate
before producing the output.
The data model specifies to use the "natural merge" function, in this
case meaning to add the current point values together because they
are Sum
data points. The expected output is, still in Cumulative
Temporality:
- (T0, T1]
- dimensions: {}, sum:
80
- dimensions: {}, sum:
- (T0, T2]
- dimensions: {}, sum:
91
- dimensions: {}, sum:
- (T0, T3]
- dimensions: {}, sum:
98
- dimensions: {}, sum:
- (T0, T4]
- dimensions: {}, sum:
107
- dimensions: {}, sum:
- (T0, T5]
- dimensions: {}, sum:
58
- dimensions: {}, sum:
- (T0, T6]
- dimensions: {}, sum:
75
- dimensions: {}, sum:
As discussed in the asynchronous cumulative temporality example above, there are various treatments available for detecting resets. Even if the first course is taken, which means doing nothing, a receiver that follows the data model's rules for unknown start time and inserting true start times will calculate a correct rate in this case. The "58" received at T5 resets the stream - the change from "107" to "58" will register as a gap and rate calculations will resume correctly at T6. The rules for reset handling are provided so that the unknown portion of "58" that was counted reflected in the "107" at T4 is not double-counted at T5 in the reset.
If the option to use per-stream start timestamps is taken above, it lightens the duties of the receiver, making it possible to monitor gaps precisely and detect overlapping streams. When per-stream state is available, the SDK has several approaches for calculating Views available in the presence of attributes that stop reporting and then reset some time later:
- By remembering the cumulative value for all streams across the
lifetime of the process, the cumulative sum will be correct despite
attributes
that come and go. The SDK has to detect per-stream resets itself in this case, otherwise the View will be calculated incorrectly. - When the cost of remembering all streams
attributes
becomes too high, reset the View and all its state, give it a new start timestamp, and let the caller see a gap in the stream.
When considering this matter, note also that the metrics API has a
recommendation for each asynchronous instrument: User code is
recommended not to provide more than one Measurement
with the same
attributes
in a single callback.. Consider
whether the impact of user error in this regard will impact the
correctness of the view. When maintaining per-stream state for the
purpose of View correctness, SDK authors may want to consider
detecting when the user makes duplicate measurements. Without
checking for duplicate measurements, Views may be calculated
incorrectly.
Memory management is a wide topic, here we will only cover some of the most important things for OpenTelemetry SDK.
Choose a better design so the SDK has less things to be memorized, avoid keeping things in memory unless there is a must need. One good example is the aggregation temporality.
Design a better memory layout, so the storage is efficient and accessing the storage can be fast. This is normally specific to the targeting programming language and platform. For example, aligning the memory to the CPU cache line, keeping the hot memories close to each other, keeping the memory close to the hardware (e.g. non-paged pool, NUMA).
Pre-allocate and pool the memory, so the SDK doesn't have to allocate memory on-the-fly. This is especially useful to language runtimes that have garbage collectors, as it ensures the hot path in the code won't trigger garbage collection.
Limit the memory usage, and handle critical memory condition. The general expectation is that a telemetry SDK should not fail the application. This can be done via some cardinality-capping algorithm - e.g. start to combine/drop some data points when the SDK hits the memory limit, and provide a mechanism to report the data loss.
Provide configurations to the application owner. The answer to "what is an efficient memory usage" is ultimately depending on the goal of the application owner. For example, the application owners might want to spend more memory in order to keep more permutations of metrics attributes, or they might want to use memory aggressively for certain attributes that are important, and keep a conservative limit for attributes that are less important.