-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add telemetry:span/3 #57
Comments
I agree, that's the most useful measurement when creating spans - I'd say that anything else could go through metadata if needed. Or, one could always emit For this to become useful, we also need an event consumer which works with this convention by default - I assume that it's |
Hi everybody! I'm just catching up on all the latest discussion, but conceptually I really like this -- it makes for a good entrypoint for the basic measurements, and provides some conventional wisdom around event names.
I would agree no, as it simplifies the metrics definitions if you can differentiate between successes and errors at the event. |
Even if this helper function isn't used, I like having a standard to follow, especially with the distinct |
Correct. We also need a way to auto-discover spans which is the other conversation. |
|
@tsloughter once again we are in a pickle because we have kind of standardized on start/stop by using it plug, phoenix, absinthe, etc. At the same time, I don't think we should call it start_span/stop_span because the events in isolation can also be used. For example, I can do |
Ah, ok. I don't think |
I was going to suggest that we always have I was trying to work out for myself how we even expect library authors to be able to send metrics in either the def work do
try do
bytes = 0
if :rand.uniform(2) == 1 do
raise "Boom"
end
bytes = 100
IO.inspect(bytes)
catch
kind, reason ->
IO.inspect({kind, reason})
IO.inspect(__STACKTRACE__)
# bytes not in scope
after
# bytes not in scope
end
end Am I missing something about how folks are planning to use these events as a transport for measurements? |
(sorry for the duped comments - GH was acting weird and saying it didn't post so I tried it again) |
I agree with @GregMefford that we |
To be clear, my point is that in case of errors we won’t be able to emit the same metadata as in case of success for stop events. For example, Ecto includes the number of returned results and this number would be unavailable in case of stop emitted after errors. We could use zero, but that would skew numbers. In a nutshell, my concern is how to always emit stop events without introducing inconsistencies on the metadata, which may skew the overall aggregated data. It is a technical problem which I don’t know how to solve. |
Could we avoid the hard decisions about which single choice is more confusing if we called
That'd support strong guarantees around the shape of the measurements and metadata on Zooming back out, I'd much appreciate this API also supporting measurement control e.g. for row count. |
@josevalim I'm wondering whether you'd even be able to emit the measurements and metadata on success, as shown in the code snippet in my previous comment. You'd want to emit the The way we handle that with tracing today is that we can add metadata for a span anytime before the I think the technical solution in This proposal also solves the issue with inconsistent aggregations you mentioned, because then it's up to the library to decide whether to emit the metrics event on error or not. For example, if there was an error while fetching from the DB in one out of 10 queries which normally returns 100 rows, what's the minimum/p99 number of rows returned by those query attempts? The library needs to decide whether the failed attempts count as having returned zero rows, or just don't count as attempts at all (I think you're saying that today, Ecto doesn't emit metrics at all on errors). With tracing though, you need to |
I would love for there to be a better option proposed, but one way to work around the breaking change on the Phoenix side would be to keep emitting the metrics in the current |
I don't think we need to emit the stop event on
So it is very feasible technically to have metadata in the I am really worried that now we are proposing for any "span" to have at least 4 different events: start, stop, error, and one extra event for the metadata. Telemetry is relatively fast but if we are not careful, those will eventually slow things down. To me it sounds much simpler to say a trace ends either with
|
Ok, that works for me. 👍 What do you think about @hauleth proposal of just putting error metadata in the |
I think it has the same issues. Sometimes stop will have We also have the issue between distinguishing "errors" and "exceptions". For example, an HTTP client may return |
Yes I agree - returning an error from an API is very different than raising an exception and I think the |
I'm a little concerned with conflating As to the proposal @josevalim put out, I feel like this pattern covers the majority of use cases. This discussion is leaning towards prescribing patterns that only suit tracing and make an assumption that OTel will be the one and only way that people get metrics. Currently, those are all tied to the Only question I have is shouldn't that reraise? |
It will definitely re-raise. Any suggestions on another name to |
There is currently a "status" field on a span. There is a long discussion about status codes, severity, etc here open-telemetry/opentelemetry-specification#306 which resolves to punt until 0.4 of the spec. |
I've seen the usage of "error" in OpenTracing, but that has a specific meaning in those frameworks. Our languages and idioms use that word in a particular way. I'm fine with bringing outside influences into our reporters where those authors have specific knowledge. For the average library author, is it useful to put that burden on them, as well? How would a user of a library interpret seeing that event in the documentation and looking at source code? Are there any arguments against using |
|
@bryannaegele why do you think it's mostly suited for OTel for metrics? |
My argument against
Not suited, just that we're framing discussions around the standard events in the context of tracing which is a very advanced concept to the average user. Tracing isn't even a standard practice at most large corporations. I just want to ensure we keep the number of events as small as possible (3 tops) and approach adding support for tracing on top of the existing metrics support we already have. |
Can we settle the last loose end of the discussion on naming conventions: I think the issue with the |
I am happy with "fail" but I believe for all reasons stated previously, our convention should be:
PS: unfortunately I could not attend the meeting yesterday, so if I missed something I apologize. |
I'm fine with "fail".
What constitutes an error? Any return that isn't a value, ok, or ok tuple? In the case where we're intercepting an exception and reraising, we have access to the stacktrace. If we're going to raise, we don't have access to the stacktrace at that point, correct? |
"throw, error or exit" in this case refers to the
Yes.
We are never going to raise ourselves. Or the user supplied function raises or nothing is raised. |
I made a seperate issue to talk specifically about guidance around timestamp tracking and monotonic times: |
Hi everyone! We started getting some PRs that follows the spec here. There is only one problem. Some of them use “time” for the system time. Others use “start_time”. Shall we pick one? I think “foo.bar.start.start_time” has a bit of repetition. So we should rather pick “system_time” or just “time” for the measurement. What are your picks? |
I could get behind |
Will it be left in native units? If not it should have a prefix like |
It will be in native, yes, so let’s go with “system_time”. I have also updated this issue description with the latest “spec”, so in case anyone wants to send a PR for this feature, it will be very appreciated! |
As an aside, I haven't seen the practical usage of that string API and wouldn't mind seeing it deprecated. Users have told me they find it confusing and it doesn't result in a name that obeys guidelines for most metric related tools, so the user is required to add the |
@josevalim Should we update the "spec" in the description to be more explicit about the duration being calculated via |
Thought I'd mention it one more time. Something like wts that keeps monotonic and offset might be better. With it you have a timestamp that is both usable for duration and for system time, it just happens to be a tuple instead of an integer. Then Telemetry can then just add functions for dealing with them, |
@tsloughter I didn't think |
@bryannaegele aaah, a |
I would say that all durations/times should have format of |
Before we get past "the point of no return," I'd like to reconsider whether I'll bring it up in tomorrow's Observability WG meeting for feedback. |
As we decided in the Observability WG meeting today, we will keep the same behavior discussed here, but rename the |
Where did we land regarding the convention for the "failure" / "exception" event? The issue summary says |
We are going with exception. Redix must be fixed. It was probably written when things were still in flux. |
Redix was fixed: whatyouhide/redix#167 |
Closing in favor of PR. |
This is a spin off from the discussion in #56. @hauleth mentioned about providing
telemetry:span/3
and I replied that I don't think we can come up with a generally useful API and we should have a separate discussion about it. So instead of me being grumpy, I thought I would open up said discussion and send a proposal:This will emit three events:
EventPrefix ++ [start]
with measurement#{system_time := erlang:system_time()}
andStartMetadata
EventPrefix ++ [exception]
with measurement#{duration := erlang:monotonic_time()}
and#{kind := throw | error | exit, reason := term(), stacktrace := list()}
as metadataEventPrefix ++ [stop]
with measurement#{duration := erlang:monotonic_time()}
andStopMetadata
The only limitation with this approach is only that we can't configure measurements but I would say configuring metadata is good enough and despite my initial comment, this should cover a good amount of scenarios.
Shall we ship with this approach?
The text was updated successfully, but these errors were encountered: