-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skeleton for the remaining metrics instruments #1617
Skeleton for the remaining metrics instruments #1617
Conversation
specification/metrics/new_api.md
Outdated
@@ -425,6 +438,78 @@ var obCaesiumOscillates = meter.CreateCounterFunc<UInt64>("caesium_oscillates", | |||
provided by the `callback`, which is registered during the [CounterFunc | |||
creation](#counterfunc-creation). | |||
|
|||
### Gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've talked about the name "gauge" in the SIG meetings. Here are my notes:
- "Gauge" is a widely used, yet very confusing term, there are many metrics systems using the word "gauge", but each of them has slightly different meaning (e.g. Prometheus Gauge is different from Micrometer Gauge).
- We've already used the term "gauge" in the Data Model Specification, and it is for timeseries model rather than event model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just throwing this out here... I do not hold a strong opinion on this naming...
Wiki have these definitions for gauge
is a device used to make measurements or in order to **display** certain dimensional information
.a device that displays the measurement of a monitored system by the use of a needle or pointer that moves along a calibrated scale
In the physical world, these gauges always include an actual display for immediate use. In our world, that is not exactly how it works. Thus, maybe we are dealing with just the "sensor" part without the display?
Wiki has definition for sensor
In the broadest definition, a sensor is a device, module, machine, or subsystem whose purpose is to detect events or changes in its environment and send the information to other electronics, frequently a computer processor. A sensor is always used with other electronics.
Wiki has a list of sensors. They include terms like *phone, *monitor, *meter, *sensor, *detector, *gauge, *scope, counter, etc...
I'm thinking maybe a sensor is more accurate to what our code does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone else wants to explore the "sensor" name please comment here.
I personally think "sensor" is a very broadly used term, and it is not found from the famous metrics systems, so I would avoid introducing it to OpenTelemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up "instrument" and one of the definitions is "measuring device used to gauge". I've come to terms with "gauge" in the sense that was mentioned in last week's SIG meeting--it's an instrument of last resort, when either a Histogram or Counter are for some reason not appropriate, meaning for cases where you do not know the aggregation being applied.
"gauge" is a great term for cases where there is not a more specific instrument, and it's a great term for when there's not a better term. "Gauge" has several meanings in English--inspecific. Gauge instruments have many kinds of aggregation--inspecific.
specification/metrics/new_api.md
Outdated
@@ -34,6 +34,15 @@ Table of Contents | |||
* [CounterFunc](#counterfunc) | |||
* [CounterFunc creation](#counterfunc-creation) | |||
* [CounterFunc operations](#counterfunc-operations) | |||
* [Gauge](#gauge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents here:
I would avoid mixing increment/decrement operations which make sense only for a "sum", with "Gauge" (set operation). So I would call this an UpDownCounter
that has only add/dec (or just add
but allows negative values) operations, but no set
. Also I've failed to get an answer for a use-case of a set
operation for a sync instrument, maybe worth documenting the use-case or add a link if I missed that.
In my mind we can have:
- Counter + CounterFunc (callback based version of the Counter, however we decide to name it) - the result is a monotonic sum
- UpDownCounter + UpDownCounterFunc (callback based version of the UpDownCounter, however we decide to name it) - the result is a non-monotonic sum
- Sync instrument for reporting values that "grouping" makes the most sense for them. We can call it
Distribution
(the name is the result of the grouping, which does not fit very well for an Instrument, consider Grouper, ValueRecorder etc.). For the moment no clear use-case for an async version of this. - Helper sync instrument for reporting "durations". Call it
Timer
? For the moment no clear use-case for an async version of this. - Async instrument to report "independent" values (e.g. room temperature). Call it
Gauge
(ValueObserver)?
From data-model perspective it is important to distinguish between a non-monotonic sum and a "gauge" (a.k.a value that we don't know how to re-aggregate/merge). The example I had in mind is if users report "cpu usage" as a metric with one label "state" with possible values ["used", "free", "inactive"] and we want to compute a new "memory usage" with no labels, excluding "state=free" we will be able to do that because we know summing the value from state=used + state=inactive is correct, if this was a gauge we cannot do that automatically, we need to ask the user what aggregation to apply when merging points.
I do understand that this requirement leaks some internal details into the API, but without user telling us via a "contract" that this represents a non-monotonic sum, we cannot infer that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed most parts of this comment, the only thing remaining (I believe) is the Timer.
I think Timer is not a blocker, it is more like a "nice to have" layer on top of distribution.
The PR is updated, please take another look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, a Gauge
is naturally async and it has a set
method. A synchronous scenario for a Gauge
could be having a synchronous process and recording the last processed something
in a gauge.
e.g.:
public updateCar(long id) {
carRepository.update(id, ...);
lastUpddatedCar.set(id);
}
In micrometer, a Gauge
is always async but it can give you the backing state object that you can update:
AtomicLong lastUpdatedCar = registry.gauge("lastUpddatedCar", new AtomicLong(-1));
lastUpddatedCar.set(id);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could somebody please help me to understand the nature of UpDownCounter
? I suppose it will not publish a rate but it will publish the value just like a Gauge
would, right?
If so, is it really needed?
There are two examples in the docs: "the number of active requests" and "the number of items in a queue". To me, both seem like a "Gauge
-scenario", more than that the first one is an example for synchronously updating a Gauge
in synchronous environments.
Let me give you an example for both:
AtomicInteger activeRequests = registry.gauge("numberGauge", new AtomicInteger(0));
public Response process(Request request) {
activeRequests.increment();
...
return response;
}
and
Queue<String> queue = new BlockingQueue();
Gauge.builder("queueSize", queue, Queue::size).register(registry);
or
Queue<String> queue = registry.gauge("queueSize", tags, new BlockingQueue<>(), Queue::size);
My ultimate question is, do we need all of these? Can't we just get along with a single Gauge like the above that can cover all of these scenarios (UpDownCounter
, UpDownCounterFunc
, Gauge
, GaugeFunc
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed with @jonatan-ivanov offline, here goes the summary why we need UpDownCounter
(the name can be changed, but we do need such thing as a separate instrument):
- We need a sync instrument (which has access to the context/baggage) which could take delta value AND is not monotonic. We cannot squeeze this into an async instrument since we do not want to have any delta value in async instrument (we want async instrument to take absolute value instead).
- This instrument by nature (since it takes delta) would allow values to be added across dimensions (take the heap size of each worker process id, you will get the total heap size of all worker processes).
- This instrument does not fit Counter or Distribution.
- Calling it Gauge is probably a bad idea, since Gauge doesn't imply additive property.
@aabmass brought up an idea that we might be able to achieve this by having two Counters #1617 (comment). I want to see more feedback to his comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'd like the decision tree we discussed (and user story of "What instrument to pick") to be highlighted in one spot
Can I propose we go further and attempt to write out what the user documentation text on the OTel website will be? Hopefully it should feel like the instruments and names we are choosing make the story we tell in the documentation concise, consistent and easily understood. If we can't make it sound that way I'd take it as a warning sign. I'm glad to take the first stab at this or collaborate with someone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reyang this is 💯
Just my opinion, but rather than |
@jkwatson I think there is an open item for discussion about renaming the |
I'll send separate PRs about this based on the options we've discussed yesterday. Update: PR #1645 is created as a follow up. |
|
||
### GaugeFunc | ||
|
||
`GaugeFunc` is an asynchronous Instrument which reports non-additive value(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a Gauge not needed?
Gauge
is a synchronous Instrument which reports non-additive values(s)
The example would be to report the room temperature when I finished an expensive operation. I have "context" as I'm associating my operation to the room temperature. If I only used GaugeFunc with a callback, I cannot provide the correct timing context as well as operation context.
Then, we have a Distribution. Am I suppose to use that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you explain why the room temperature is associated with operation context? Or consider finding another scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing room temperature to CPU temperature. I want to record the CPU temperature after I run a set of benchmark tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you? I think this is kind of an artificial use-case. If you really want that you can save the value and return it in the func for the moment.
Also probably at that point you want the distribution of temperatures after multiple runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to understand why this scenario and/or pattern is artificial or unwanted. The pattern involves doing an operation (where I have my context) and I want to report a value (that is not additive).
i.e.
- Report my CPU Temperature (value: Temp) after a benchmark test (context: test_name)
- After processing an incoming request (context: Type of request) I want to report a status code (value: 1=success/2=failure/#=etc.)
Are we saying I should use the sync Distribution instrument for these?
If you really want that you can save the value and return it in the func for the moment.
This is true. However, current design requires the callback to report for same time. This presents a problem trying to report multiple operations from one (async) collection period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will admit that I have trouble finding many real-life examples of a synchronous gauge. The most common use of the term, I think comes from reading the speedometer of a car. This reading is instantaneous, you didn't make any request to your dashboard.
For real-life examples, I'm thinking about reading an electrical meter. You are an energy company and need to send people or RPCs around to each meter to get a reading of the current usage. As this request makes it's way into the meter, it has context: which part of the grid it's on, where's the substation, who's taking the measurement, and so on. When it reaches the meter, it might see Voltage (a true Gauge) and kWh used (UpDownSum) and want to Record them.
I've introduced a conflict--we're talking about adding a synchronous Gauge (to set Voltage) but we have not been not talking about adding a synchronous UpDownSum instrument, for synchronously making cumulative observations (discussed as long ago as open-telemetry/oteps#88), and I hope that we don't.
So why is it important to offer synchronous true Gauge? Maybe because of historical precedent. However, it OTLP is going to embrace its Non-Monotonic Cumulative Sum point, which is often confused with Gauge, then we better have a good answer as to why we're not introducing a Synchronous Non-Monotonic Cumulative Sum instrument. This would be the instrument used by the meter reader at my house, since I have solar panels. 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve with two caveats:
- I don't like Histogram of "Func" names, but don't have an alternative to propose you haven't already considered or that doesn't violate some other principle for the API.
- I'd like the decision tree we discussed (and user story of "What instrument to pick") to be highlighted in one spot
`UpDownCounter` is a synchronous Instrument which supports increments and | ||
decrements. | ||
|
||
Note: if the value grows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to all these call-outs when to use other instruments.
Example uses for `UpDownCounter`: | ||
|
||
* the number of active requests | ||
* the number of items in a queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example, lets say I have an app runs on a bunch of machines and each machine has a bunch of queues with incoming customer traffic (I'm guessing this is a common situation). I know that if any of these queues gets particularly large that is a bad sign, implying high latency and maybe failure in the near future because my service is receiving requests faster than it can process them. In my app I create an UpDownCounter and use the queue name as a label. Then in my monitoring dashboard I want to set up alerts and graphs that show me the maximum queue size (or maybe the 99 percentile of queue size) across my fleet. Can I do that or should I have chosen a different instrument for this use case?
I think the underlying question may go more to the data model, but without answering it I don't know what to expect from the API. The data model refers to data streams having aggregate functions that work across both temporal and spatial dimensions. Does that mean the aggregation is the same no matter which dimension I am aggregating across? In this example I picked UpDownCounter because it is summing over time but across the queue dimension I don't want a sum, I want a maximum or a percentile. It wasn't clear if there was any way to express this.
I've probably got various follow up questions, but I need to figure this one out first : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my take:
UpDownCounter
is the proper instrument since "number of items" by nature allows Sum. The "number of items" can be added across temporal and spatial dimensions. Although depending on the scenario one might prefer other aggregations (e.g. "the total number of items in all queues" might make more sense if the queues are homogeneous, and will make less sense if each queue is serving a totally different purpose).- My answer would be yes - one should be able to customize the consumption experience and do alerting based on maximum queue size (or 99 percentile queue size) without having to use a different instrument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Reiley! This leads me to a few follow ups:
- My understanding of the data model spec is that there is a single aggregation function for data streams across spatial and time dimensions whereas your answer implies there can be different aggregation functions for different dimensions. Am I misreading the spec or if not how would this configuration work?
- I would guess that it is rare to find a scenario where summing is the desirable spatial re-aggregation function for every label dimension. For example summing over a machine name label or a process instance label implies all the alert levels would have to be reconfigured whenever the number of machines/processes in the distributed system changes. Does that match your expectations or am I not considering the right scenarios?
- Following (2), if the default configuration for an UpDownCounter is that it always sums when aggregated over any label and nearly all scenarios have labels that should not be summed over, this implies nearly everyone will need to change the configuration away from the default for their UpDownCounters. If GaugeFunc and UpDownCounterFunc only differ based on their default spatial aggregation behavior and most use-cases won't be able to use those defaults then for practical purposes is it fair to say that UpDownCounterFunc and GaugeFunc don't have any differences that are likely to impact the typical scenarios that would make use of them? (I'm not implying that redundancy automatically needs to be eliminated, I just want to understand what are the practical differences, if any)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be performance implications of using UpDownCounter
in the way described here. Although we think of UpDownCounter
as a way to encode non-monotonic deltas semantically, they are almost always used with in-process aggregation, thus exported in cumulative form.
The consumer of a stream of non-monotonic deltas is able to calculate a smoothed rate over windows of time, but has to start reading and keeping state itself from beginning of time in order to extract actual queue sizes in these examples.
The reason users will sometimes prefer UpDownCounter
to its asynchronous form is that the Metric SDK can be configured to compute virtual queue sizes along multiple sets of dimensions, whereas an asynchronous callback will likely only be able to report on actual queue sizes.
I occasionally talk about a "Stateless" exporter configuration for OTLP exporters and OTel SDKs that outputs deltas instead of cumulative sums, thus allowing the SDK to release memory. It makes sense to have a stateless option for all the instruments except UpDownCounter
because of the preceding argument--they're almost always useful aggregated from the beginning of time, and it SHOULD be the caller's responsibility if possible to maintain that state. Still, I'd like it to be possible for a stateless SDK to output UpDownCounter
aggregations with delta temporality, provided the consumer can maintain lifetime state on behalf of the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmacd - Thanks! Your answer makes me wonder a few things:
-
Your answer felt detailed, but our current guidance/decision tree does not contain that level of detail. Are you worried that users applying a simple decision tree such as "Queue size is good to monitor with UpDownCounter" will have a bad outcome or you think the they will still be reasonably served by the simple advice?
-
Thinking more about the requirement for the SDK to maintain state, this worries me about recommending people to use UpDownCounter in scenarios like this. I know that the common/simple scenario for monitoring involves enabling the SDK at process startup and having it track state for the lifetime of the service but if possible I'd like the API to support other scenarios such as:
- Ops personel use a monitoring dashboard to reconfigure collection policies dynamically. New configuration is transmitted over the network to a monitored app which applies it without restarting.
- Ad-hoc tools similar to perfmon on Windows or top on Linux want to dynamically start listening to a metric for a period of time, then stop.
These scenarios are places where ideally the SDK does not pay the cost of maintaining the state up-front, but it can enabled on demand. However since it is impossible to recover the cummulative state of an UpDownCounter the SDK is forced to either track everything just in case or fail any request to start tracking on demand. However if the developer had used an instrument where they provided the size of the queue directly rather than its delta (sync sum, sync Gauge, async sum, async Gauge) there would be no problem enabling these metrics on demand. As the developer I'd expect it is just as easy for me to record total queue size as it is to record the delta in queue size so if one of these options gives me more capabilities than the other then picking UpDownCounter sounds like it would be a worse choice and I'll never want to pick it.
(Also I still think the follow up questions in my 2nd post on this thread remain unanswered, apologies if I am being dense and this was intended to answer them as apparently I didn't understand that connection)
I'll admit initially I wasn't a fan of ValueRecorder but it has grown on me while "Distribution" feels worse over time (and Reiley using 'Distribution' in the first place may be my fault, I used that name in the .NET metric prototype code). Our other names fit the naming pattern of being an instrument (a Counter counts, a Gauge gauges) whereas "Distribution" doesn't describe an instrument. I agree that ValueRecorder is quite generic, but framed correctly that might be exactly what we need to tell a good story. For example the docs might say something like:
|
I'm not sure I like I believe our goal is to choose instrument names that sound like instruments. This may be a stretch, but I want to write that "a Histogram is a statistical instrument for measuring a distribution". |
IMO, it is a stretch : ) If I read that sentence in the OTel docs this is what goes through my head: "I've never thought of Histogram this way before and it doesn't feel like an intuitive definition, but I guess the OTel folks are making their own novel re-definition of the term and I'll do my best to roll with it" |
We've discussed this PR during the 04/22/2021 Metrics API/SDK Spec SIG Meeting. Here goes the conclusion:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the @reyang comment lgtm
Policy https://github.com/open-telemetry/opentelemetry-specification/blob/main/CONTRIBUTING.md#how-to-get-your-pr-merged is satisfied except the time from the last significant change. I will merge tomorrow evening PT if no objections will be posted. |
I'm OK with the PR, but I do still have significant concerns that UpDownCounter's requirements to maintain state forever make it incompatible with scenarios where the monitoring user wants to capture data ad-hoc or do dynamic reconfiguration. If nothing changed on that aspect of the design (or in my understanding of the issue) my guess is that .NET would implement the instrument but never recommend using it. (#1617 (comment)) |
* update toc * add the skeleton * improve wording * have separate instruments for additive/non-additive async scenario * rollback the observe/report change * rename Distribution to Histogram based on the SIG meeting discussion Co-authored-by: Sergey Kanzhelev <S.Kanzhelev@live.com>
Related to #1578 and #1590.
Changes
So far we've made good progress on the
Counter
andCounterFunc
APIs in #1578 and #1590. This PR tries to propose the names for the remaining instruments. Note that I intentionally left severalTODO
s just to help us focusing on the skeleton, once this PR got merged, I will send a follow-up PR to fill in the real meat.When deciding which Instrument to use, here goes the decision tree:
Counter
UpDownCounter
GaugeFunc
CounterFunc
UpDownCounterFunc
Histogram
Histogram
(orTimer
if we have it as a specialized version ofHistogram
)Another way of making decision on Instrument:
Histogram
Counter
UpDownCounter
CounterFunc
UpDownCounterFunc
GaugeFunc
After this PR, the only outstanding issues for the metrics API spec would be:
Related oteps OTEP146