diff --git a/specification/metrics/supplementary-guidelines.md b/specification/metrics/supplementary-guidelines.md new file mode 100644 index 00000000000..6e528704286 --- /dev/null +++ b/specification/metrics/supplementary-guidelines.md @@ -0,0 +1,251 @@ +# Supplementary Guidelines + +Note: this document is NOT a spec, it is provided to support the Metrics +[API](./api.md) and [SDK](./sdk.md) specifications, it does NOT add any extra +requirements to the existing specifications. + +Table of Contents: + +* [Guidelines for instrumentation library + authors](#guidelines-for-instrumentation-library-authors) +* [Guidelines for SDK authors](#guidelines-for-sdk-authors) + * [Aggregation temporality](#aggregation-temporality) + * [Memory management](#memory-management) + +## Guidelines for instrumentation library authors + +TBD + +## Guidelines for SDK authors + +### Aggregation temporality + +The OpenTelemetry Metrics [Data Model](./datamodel.md) and [SDK](./sdk.md) are +designed to support both Cumulative and Delta +[Temporality](./datamodel.md#temporality). It is important to understand that +temporality will impact how the SDK could manage memory usage. Let's take the +following HTTP requests example: + +* During the time range (T0, T1]: + * verb = `GET`, status = `200`, duration = `50 (ms)` + * verb = `GET`, status = `200`, duration = `100 (ms)` + * verb = `GET`, status = `500`, duration = `1 (ms)` +* During the time range (T1, T2]: + * no HTTP request has been received +* During the time range (T2, T3] + * verb = `GET`, status = `500`, duration = `5 (ms)` + * verb = `GET`, status = `500`, duration = `2 (ms)` +* During the time range (T3, T4]: + * verb = `GET`, status = `200`, duration = `100 (ms)` +* During the time range (T4, T5]: + * verb = `GET`, status = `200`, duration = `100 (ms)` + * verb = `GET`, status = `200`, duration = `30 (ms)` + * verb = `GET`, status = `200`, duration = `50 (ms)` + +Let's imagine we export the metrics as [Histogram](./datamodel.md#histogram), +and to simplify the story we will only have one histogram bucket `(-Inf, +Inf)`: + +If we export the metrics using **Delta Temporality**: + +* (T0, T1] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: + `1 (ms)` +* (T1, T2] + * nothing since we don't have any Measurement received +* (T2, T3] + * dimensions: {verb = `GET`, status = `500`}, count: `2`, min: `2 (ms)`, max: + `5 (ms)` +* (T3, T4] + * dimensions: {verb = `GET`, status = `200`}, count: `1`, min: `100 (ms)`, + max: `100 (ms)` +* (T4, T5] + * dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `30 (ms)`, max: + `100 (ms)` + +You can see that the SDK **only needs to track what has happened after the +latest collection/export cycle**. For example, when the SDK started to process +measurements in (T1, T2], it can completely forget about +what has happened during (T0, T1]. + +If we export the metrics using **Cumulative Temporality**: + +* (T0, T1] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: + `1 (ms)` +* (T0, T2] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: + `1 (ms)` +* (T0, T3] + * dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: + `5 (ms)` +* (T0, T4] + * dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `50 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: + `5 (ms)` +* (T0, T5] + * dimensions: {verb = `GET`, status = `200`}, count: `6`, min: `30 (ms)`, max: + `100 (ms)` + * dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: + `5 (ms)` + +You can see that we are performing Delta->Cumulative conversion, and the SDK +**has to track what has happened prior to the latest collection/export cycle**, +in the worst case, the SDK **will have to remember what has happened since the +beginning of the process**. + +Imagine if we have a long running service and we collect metrics with 7 +dimensions and each dimension can have 30 different values. We might eventually +end up having to remember the complete set of all `21,870,000,000` permutations! +This **cardinality explosion** is a well-known challenge in the metrics space. + +Making it even worse, if we export the permutations even if there are no recent +updates, the export batch could become huge and will be very costly. For +example, do we really need/want to export the same thing for (T0, +T2] in the above case? + +So here are some suggestions that we encourage SDK implementers to consider: + +* You want to control the memory usage rather than allow it to grow indefinitely + / unbounded - regardless of what aggregation temporality is being used. +* You want to improve the memory efficiency by being able to **forget about + things that are no longer needed**. +* You probably don't want to keep exporting the same thing over and over again, + if there is no updates. You might want to consider [Resets and + Gaps](./datamodel.md#resets-and-gaps). For example, if a Cumulative metrics + stream hasn't received any updates for a long period of time, would it be okay + to reset the start time? + +In the above case, we have Measurements reported by a [Histogram +Instrument](./api.md#histogram). What if we collect measurements from an +[Asynchronous Counter](./api.md#asynchronous-counter)? + +The following example shows the number of [page +faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the +thread ever started: + +* During the time range (T0, T1]: + * pid = `1001`, tid = `1`, #PF = `50` + * pid = `1001`, tid = `2`, #PF = `30` +* During the time range (T1, T2]: + * pid = `1001`, tid = `1`, #PF = `53` + * pid = `1001`, tid = `2`, #PF = `38` +* During the time range (T2, T3] + * pid = `1001`, tid = `1`, #PF = `56` + * pid = `1001`, tid = `2`, #PF = `42` +* During the time range (T3, T4]: + * pid = `1001`, tid = `1`, #PF = `60` + * pid = `1001`, tid = `2`, #PF = `47` +* During the time range (T4, T5]: + * thread 1 died, thread 3 started + * pid = `1001`, tid = `2`, #PF = `53` + * pid = `1001`, tid = `3`, #PF = `5` + +If we export the metrics using **Cumulative Temporality**: + +* (T0, T1] + * dimensions: {pid = `1001`, tid = `1`}, sum: `50` + * dimensions: {pid = `1001`, tid = `2`}, sum: `30` +* (T0, T2] + * dimensions: {pid = `1001`, tid = `1`}, sum: `53` + * dimensions: {pid = `1001`, tid = `2`}, sum: `38` +* (T0, T3] + * dimensions: {pid = `1001`, tid = `1`}, sum: `56` + * dimensions: {pid = `1001`, tid = `2`}, sum: `42` +* (T0, T4] + * dimensions: {pid = `1001`, tid = `1`}, sum: `60` + * dimensions: {pid = `1001`, tid = `2`}, sum: `47` +* (T0, T5] + * dimensions: {pid = `1001`, tid = `2`}, sum: `53` + * dimensions: {pid = `1001`, tid = `3`}, sum: `5` + +It is quite straightforward - we just take the data being reported from the +asynchronous instruments and send them. We might want to consider if [Resets and +Gaps](./datamodel.md#resets-and-gaps) should be used to denote the end of a +metric stream - e.g. thread 1 died, the thread ID might be reused by the +operating system, and we probably don't want to confuse the metrics backend. + +If we export the metrics using **Delta Temporality**: + +* (T0, T1] + * dimensions: {pid = `1001`, tid = `1`}, delta: `50` + * dimensions: {pid = `1001`, tid = `2`}, delta: `30` +* (T1, T2] + * dimensions: {pid = `1001`, tid = `1`}, delta: `3` + * dimensions: {pid = `1001`, tid = `2`}, delta: `8` +* (T2, T3] + * dimensions: {pid = `1001`, tid = `1`}, delta: `3` + * dimensions: {pid = `1001`, tid = `2`}, delta: `4` +* (T3, T4] + * dimensions: {pid = `1001`, tid = `1`}, delta: `4` + * dimensions: {pid = `1001`, tid = `2`}, delta: `5` +* (T4, T5] + * dimensions: {pid = `1001`, tid = `2`}, delta: `6` + * dimensions: {pid = `1001`, tid = `3`}, delta: `5` + +You can see that we are performing Cumulative->Delta conversion, and it requires +us to remember the last value of **every single permutation we've encountered so +far**, because if we don't, we won't be able to calculate the delta value using +`current value - last value`. And as you can tell, this is super expensive. + +Making it more interesting, if we have min/max value, it is **mathematically +impossible** to reliably deduce the Delta temporality from Cumulative +temporality. For example: + +* If the maximum value is 10 during (T0, T2] and the + maximum value is 20 during (T0, T3], we know that the + maximum value during (T2, T3] must be 20. +* If the maximum value is 20 during (T0, T2] and the + maximum value is also 20 during (T0, T3], we wouldn't + know what the maximum value is during (T2, T3], unless + we know that there is no value (count = 0). + +So here are some suggestions that we encourage SDK implementers to consider: + +* You probably don't want to encourage your users to do Cumulative->Delta + conversion. Actually, you might want to discourage them from doing this. +* If you have to do Cumulative->Delta conversion, and you encountered min/max, + rather than drop the data on the floor, you might want to convert them to + something useful - e.g. [Gauge](./datamodel.md#gauge). + +### Memory management + +Memory management is a wide topic, here we will only cover some of the most +important things for OpenTelemetry SDK. + +**Choose a better design so the SDK has less things to be memorized**, avoid +keeping things in memory unless there is a must need. One good example is the +[aggregation temporality](#aggregation-temporality). + +**Design a better memory layout**, so the storage is efficient and accessing the +storage can be fast. This is normally specific to the targeting programming +language and platform. For example, aligning the memory to the CPU cache line, +keeping the hot memories close to each other, keeping the memory close to the +hardware (e.g. non-paged pool, +[NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access)). + +**Pre-allocate and pool the memory**, so the SDK doesn't have to allocate memory +on-the-fly. This is especially useful to language runtimes that have garbage +collectors, as it ensures the hot path in the code won't trigger garbage +collection. + +**Limit the memory usage, and handle critical memory condition.** The general +expectation is that a telemetry SDK should not fail the application. This can be +done via some dimension-capping algorithm - e.g. start to combine/drop some data +points when the SDK hits the memory limit, and provide a mechanism to report the +data loss. + +**Provide configurations to the application owner.** The answer to _"what is an +efficient memory usage"_ is ultimately depending on the goal of the application +owner. For example, the application owners might want to spend more memory in +order to keep more permutations of metrics dimensions, or they might want to use +memory aggressively for certain dimensions that are important, and keep a +conservative limit for dimensions that are less important.