Add telemetry:span/3 #57

josevalim · 2020-01-15T19:51:52Z

This is a spin off from the discussion in #56. @hauleth mentioned about providing telemetry:span/3 and I replied that I don't think we can come up with a generally useful API and we should have a separate discussion about it. So instead of me being grumpy, I thought I would open up said discussion and send a proposal:

telemetry:span(EventPrefix, StartMetadata, (() -> {Result, StopMetadata}) :: Result

This will emit three events:

EventPrefix ++ [start] with measurement #{system_time := erlang:system_time()} and StartMetadata
EventPrefix ++ [exception] with measurement #{duration := erlang:monotonic_time()} and #{kind := throw | error | exit, reason := term(), stacktrace := list()} as metadata
EventPrefix ++ [stop] with measurement #{duration := erlang:monotonic_time()} and StopMetadata

The only limitation with this approach is only that we can't configure measurements but I would say configuring metadata is good enough and despite my initial comment, this should cover a good amount of scenarios.

Shall we ship with this approach?

The text was updated successfully, but these errors were encountered:

arkgil · 2020-01-15T21:28:42Z

I would say configuring metadata is good enough

I agree, that's the most useful measurement when creating spans - I'd say that anything else could go through metadata if needed. Or, one could always emit start, stop, error manually.

For this to become useful, we also need an event consumer which works with this convention by default - I assume that it's Telemetry.Metrics.span from beam-telemetry/telemetry_metrics#53, right?

mcrumm · 2020-01-15T21:36:35Z

Hi everybody! I'm just catching up on all the latest discussion, but conceptually I really like this -- it makes for a good entrypoint for the basic measurements, and provides some conventional wisdom around event names.

should stop be emitted in case of error?

I would agree no, as it simplifies the metrics definitions if you can differentiate between successes and errors at the event.

binaryseed · 2020-01-15T21:42:26Z

Even if this helper function isn't used, I like having a standard to follow, especially with the distinct error event.

josevalim · 2020-01-16T08:39:36Z

For this to become useful, we also need an event consumer which works with this convention by default - I assume that it's Telemetry.Metrics.span from beam-telemetry/telemetry_metrics#53, right?

Correct. We also need a way to auto-discover spans which is the other conversation.

tsloughter · 2020-01-19T16:02:11Z

start_span would be safer than start. Or is the EventPrefix supposed to include something about this being a span event?

josevalim · 2020-01-19T16:06:35Z

@tsloughter once again we are in a pickle because we have kind of standardized on start/stop by using it plug, phoenix, absinthe, etc.

At the same time, I don't think we should call it start_span/stop_span because the events in isolation can also be used. For example, I can do counter([:my_app, :request, :start]) while counter([:my_app, :request, :start_span])` would be a bit weird.

tsloughter · 2020-01-19T16:35:53Z

Ah, ok. I don't think counter([..., start_span]) doesn't make sense, its a counter that counts how many times a span (operation) is started :). But if start is already in common usage then it isn't worth changing.

GregMefford · 2020-01-19T20:58:39Z

I was going to suggest that we always have start and stop events, with the error being optional and not having the same metrics as the stop event, since it's more like an annotation that an error occurred during the span and the emitting library would still have "control" of the process and might do some more work before it officially sends the stop event and "returns control" to the parent span:

I was trying to work out for myself how we even expect library authors to be able to send metrics in either the stop or error events. Imagine we want to calculate how many bytes were transferred in some operation that might crash, so we follow the recommended try/rescue/after structure:

  def work do
    try do
      bytes = 0 
      if :rand.uniform(2) == 1 do
        raise "Boom"
      end 
      bytes = 100 
      IO.inspect(bytes)
    catch
      kind, reason ->
        IO.inspect({kind, reason})
        IO.inspect(__STACKTRACE__)
        # bytes not in scope
    after
      # bytes not in scope
    end 
  end

Am I missing something about how folks are planning to use these events as a transport for measurements?

GregMefford · 2020-01-19T21:09:04Z

(sorry for the duped comments - GH was acting weird and saying it didn't post so I tried it again)

hauleth · 2020-01-19T21:12:09Z

I agree with @GregMefford that we error event should be independent from start and stop. It IMHO makes more sense. Alternatively the stop event could be edited with error in the metadata if such happened instead of completely separate event.

josevalim · 2020-01-19T21:13:16Z

To be clear, my point is that in case of errors we won’t be able to emit the same metadata as in case of success for stop events. For example, Ecto includes the number of returned results and this number would be unavailable in case of stop emitted after errors. We could use zero, but that would skew numbers.

In a nutshell, my concern is how to always emit stop events without introducing inconsistencies on the metadata, which may skew the overall aggregated data.

It is a technical problem which I don’t know how to solve.

garthk · 2020-01-20T00:11:13Z

Could we avoid the hard decisions about which single choice is more confusing if we called :telemetry.execute/4 more than once at the tail end?

start at the head
stop and stop_normal at the tail when we return a value
stop and stop_exception at the tail when we crash

That'd support strong guarantees around the shape of the measurements and metadata on start and stop_normal, for telemetry consumers needing such. It'd also let those interested only in duration and status simply hook up start and stop.

Zooming back out, I'd much appreciate this API also supporting measurement control e.g. for row count.

GregMefford · 2020-01-20T02:38:25Z

@josevalim I'm wondering whether you'd even be able to emit the measurements and metadata on success, as shown in the code snippet in my previous comment. You'd want to emit the stop tracing hook in the after block, but you don't have access to the metrics from inside the main function body at that point, unless I'm missing something.

The way we handle that with tracing today is that we can add metadata for a span anytime before the stop, so if we want to attach some metrics or other information (for example, number of rows returned from the database), we do it from the main function body where they're in-scope and then call stop in the after block.

I think the technical solution in :telemetry framework we're talking about would be to simply not have metrics on the same events that are used for tracing, but rather use separate metrics events using whatever naming convention makes sense for the library (e.g. phoenix.endpoint.call with the metrics/metadata that are currently returned in phoenix.endpoint.stop, or firing a separate event for each particular metric like Hackney does). This is of course a breaking change for Phoenix because they're currently on the stop event, but I don't see an easy way around that if we want to have this naming convention for tracing and also attach metrics to it. You could still emit a duration because it can be easily calculated by the time delta from the start to stop, regardless of what happens in the try...end block, but not things like count of DB rows returned. It wouldn't be a breaking change for Ecto, though, because their metric suffix is :query anyway, so they'd just need to add :start and :stop events to the existing metrics for tracing.

This proposal also solves the issue with inconsistent aggregations you mentioned, because then it's up to the library to decide whether to emit the metrics event on error or not. For example, if there was an error while fetching from the DB in one out of 10 queries which normally returns 100 rows, what's the minimum/p99 number of rows returned by those query attempts? The library needs to decide whether the failed attempts count as having returned zero rows, or just don't count as attempts at all (I think you're saying that today, Ecto doesn't emit metrics at all on errors). With tracing though, you need to stop what you started or else it messes up the whole stack, so you don't get to choose not to account for that attempt from a tracing standpoint, but if you have separate metrics events, you can choose whether to fire it or not in case of errors.

GregMefford · 2020-01-20T02:52:09Z

I would love for there to be a better option proposed, but one way to work around the breaking change on the Phoenix side would be to keep emitting the metrics in the current phoenix.endpoint.stop event, but add an additional set of events specifically for tracing, like phoenix.tracing.{endpoint,controller,view}.{start,stop}

josevalim · 2020-01-20T08:31:48Z

You'd want to emit the stop tracing hook in the after block

I don't think we need to emit the stop event on after. If you need to include what happens in after (which is not every common in the first place), your can always wrap it in another try/catch. For example, your code could be written like this, where actual_work is the work function you provided above.

  def work do
    try do
      instrument_start(0)
      actual_work()
    catch
      _, _ ->
        instrument_error(error_metadata)
        instrument_stop(no_metadata_here_sad_face)
    else
      result ->
        instrument_stop(result)
    end
  end

So it is very feasible technically to have metadata in the stop event. It is not an actual limitation.

I am really worried that now we are proposing for any "span" to have at least 4 different events: start, stop, error, and one extra event for the metadata. Telemetry is relatively fast but if we are not careful, those will eventually slow things down. To me it sounds much simpler to say a trace ends either with stop or error, for the reasons below:

we reduce the amount of required events
as shown above, emitting metadata on stop does not reduce the quality of the spans
having to listen to both stop and errormakes it slightly harder for reporters, but I would rather push the complexity to reporters (which are maintained by dozens of people) rather than telemetry users (which are much more and we want to increase)

GregMefford · 2020-01-20T13:58:56Z

Ok, that works for me. 👍

What do you think about @hauleth proposal of just putting error metadata in the :stop event instead of having a separate :error event? I think it could be nice to only have to think about two types of events, but it might be easier to just pattern match on the :error atom instead of matching on specific shape of metadata. I think I’m leaning more toward having the :error event replace the :stop event when there’s a crash. This “makes sense” to me because exceptions break the normal flow of execution, which “explains” why you get something else instead of a :stop.

josevalim · 2020-01-20T14:06:04Z

What do you think about @hauleth proposal of just putting error metadata in the :stop event instead of having a separate :error event?

I think it has the same issues. Sometimes stop will have :error metadata, sometimes it won't, and when it has the :error metadata, it won't have the usual success metadata too.

We also have the issue between distinguishing "errors" and "exceptions". For example, an HTTP client may return {:error, _}, so that would be a :stop and not an :error. So I think we should keep :error exclusively for "this blew up and you won't have the usual information you usually would - you won't have a stop either". Does that make sense? It is meant to be triggered when we indeed can't submit the stop.

GregMefford · 2020-01-20T15:48:00Z

Yes I agree - returning an error from an API is very different than raising an exception and I think the :error event should only be for the latter.

bryannaegele · 2020-01-20T17:41:11Z

I'm a little concerned with conflating :error with exceptions. I could be proven wrong, but I feel like this could cause confusion since that atom is an idiomatic normal return value.

As to the proposal @josevalim put out, I feel like this pattern covers the majority of use cases. This discussion is leaning towards prescribing patterns that only suit tracing and make an assumption that OTel will be the one and only way that people get metrics. Currently, those are all tied to the :stop event which this pattern covers.

Only question I have is shouldn't that reraise?

josevalim · 2020-01-20T17:44:17Z

It will definitely re-raise. Any suggestions on another name to :error? :error actually means only one failure type in Erlang. Do span technologies or OTel use a specific name for errors/failures/etc?

tsloughter · 2020-01-20T17:50:19Z

There is currently a "status" field on a span. There is a long discussion about status codes, severity, etc here open-telemetry/opentelemetry-specification#306 which resolves to punt until 0.4 of the spec.

bryannaegele · 2020-01-20T17:57:48Z

I've seen the usage of "error" in OpenTracing, but that has a specific meaning in those frameworks. Our languages and idioms use that word in a particular way. I'm fine with bringing outside influences into our reporters where those authors have specific knowledge. For the average library author, is it useful to put that burden on them, as well? How would a user of a library interpret seeing that event in the documentation and looking at source code?

Are there any arguments against using :exception which doesn't seem to have any potential for confusion?

josevalim · 2020-01-20T18:05:00Z

exception has a meaning in Elixir and therefore can be potentially confusing too. We could use fail or abort?

arkgil · 2020-01-20T19:38:16Z

fail sounds good, as it's none of the failure types in Erlang so it can cover all of them. abort to me looks more like the operation was aborted deliberately.

This discussion is leaning towards prescribing patterns that only suit tracing and make an assumption that OTel will be the one and only way that people get metrics

@bryannaegele why do you think it's mostly suited for OTel for metrics?

bryannaegele · 2020-01-20T20:51:49Z

My argument against error is that it has a specific meaning and we don't conflate exceptions with errors. An exception event emitted in a catch would feel natural to me.

@bryannaegele why do you think it's mostly suited for OTel for metrics?

Not suited, just that we're framing discussions around the standard events in the context of tracing which is a very advanced concept to the average user. Tracing isn't even a standard practice at most large corporations.

I just want to ensure we keep the number of events as small as possible (3 tops) and approach adding support for tracing on top of the existing metrics support we already have.

GregMefford · 2020-02-14T13:22:38Z

Can we settle the last loose end of the discussion on naming conventions: start, stop and (optionally) fail?

I think the issue with the exception name is that it may not always result from an actual Elixir Exception, and abort looks intentional rather than the result of a crash.

josevalim · 2020-02-14T13:35:27Z

I am happy with "fail" but I believe for all reasons stated previously, our convention should be:

A "start" is emitted at the beginning of every span
Then ONE OF "stop" and "fail" will be emitted:
- "stop" is emitted if the function did not throw, error or exit
- "fail" is emitted if the function did throw, error or exit
If the process crashes due to a link, then none of "stop" or "fail" may be emitted

PS: unfortunately I could not attend the meeting yesterday, so if I missed something I apologize.

bryannaegele · 2020-02-14T16:16:05Z

I'm fine with "fail".

"fail" is emitted if the function did throw, error or exit

What constitutes an error? Any return that isn't a value, ok, or ok tuple?

In the case where we're intercepting an exception and reraising, we have access to the stacktrace. If we're going to raise, we don't have access to the stacktrace at that point, correct?

josevalim · 2020-02-14T16:30:31Z

"throw, error or exit" in this case refers to the throw, error, and exit atoms you can handle inside the catch part of a try. I.e. they are non-local returns.

In the case where we're intercepting an exception and reraising, we have access to the stacktrace.

Yes.

If we're going to raise, we don't have access to the stacktrace at that point, correct?

We are never going to raise ourselves. Or the user supplied function raises or nothing is raised.

binaryseed · 2020-02-25T22:50:05Z

I made a seperate issue to talk specifically about guidance around timestamp tracking and monotonic times:

Guidance around monotonic times #59

josevalim · 2020-03-09T18:08:13Z

Hi everyone! We started getting some PRs that follows the spec here. There is only one problem. Some of them use “time” for the system time. Others use “start_time”. Shall we pick one? I think “foo.bar.start.start_time” has a bit of repetition. So we should rather pick “system_time” or just “time” for the measurement. What are your picks?

binaryseed · 2020-03-09T18:22:54Z

I could get behind system_time, it has a nice direct symmetry with functions in Elixir & Erlang, so it is easy to determine the exact meaning of the measurement by looking up the docs

tsloughter · 2020-03-09T18:30:42Z

Will it be left in native units? If not it should have a prefix like system_time_nanosecs.

josevalim · 2020-03-09T18:33:03Z

It will be in native, yes, so let’s go with “system_time”. I have also updated this issue description with the latest “spec”, so in case anyone wants to send a PR for this feature, it will be very appreciated!

bryannaegele · 2020-03-09T18:33:51Z

system_time is my vote. @tsloughter the spec would be it should always be native units and we'll have control of that in this proposal.

As an aside, I haven't seen the practical usage of that string API and wouldn't mind seeing it deprecated. Users have told me they find it confusing and it doesn't result in a name that obeys guidelines for most metric related tools, so the user is required to add the event_name regardless.

binaryseed · 2020-03-09T18:39:52Z

@josevalim Should we update the "spec" in the description to be more explicit about the duration being calculated via monotonic_time? I think that's non-obvious to most folks w/o experience in this stuff...

tsloughter · 2020-03-09T18:43:52Z

Thought I'd mention it one more time. Something like wts that keeps monotonic and offset might be better.

With it you have a timestamp that is both usable for duration and for system time, it just happens to be a tuple instead of an integer.

Then system_time/monotonic_time can both be dropped and just timestamp be given which is a 2-tuple of the monotonic time and the offset.

Telemetry can then just add functions for dealing with them, telemetry:duration(Timestamp1, Timestamp2), telemetry:to_system_time(Timestamp), etc

bryannaegele · 2020-03-09T18:49:45Z

@tsloughter I didn't think monotonic_time was a thing anymore as a measurement that gets emitted. It's just used in the wrapper for calculating the duration.

tsloughter · 2020-03-09T18:51:59Z

@bryannaegele aaah, a duration is reported. I see, nevermind :)

hauleth · 2020-03-09T19:33:43Z

I would say that all durations/times should have format of non_neg_integer() | {non_neg_integer(), erlang:time_unit()} where raw integer mean that it uses native time unit. There could be telemetry:to_unit(Time, Unit) function that would allow conversion if someone would need that.

GregMefford · 2020-03-11T21:46:04Z

Before we get past "the point of no return," I'd like to reconsider whether failure is the right word for the abnormal exit condition as opposed to the normal stop event. There are lots of kinds of failures in libraries and it seems like exception is a better term to use for something exceptional like a process crash, as opposed to a failure like {:error, :no_connection} when you try to make an API call that failed for some reason. This came up as some feedback as we started to integrate this convention in a few libraries. I just want to make sure we align on the right path before we merge a bunch of PRs within our small circle and regret it when we get broader community feedback.

I'll bring it up in tomorrow's Observability WG meeting for feedback.

GregMefford · 2020-03-12T17:11:28Z

As we decided in the Observability WG meeting today, we will keep the same behavior discussed here, but rename the :failure event to :exception to make it more clear that this is specifically a case where the normal code path didn't execute to the normal :stop event, and it should always have a stacktrace associated.

binaryseed · 2020-04-01T22:41:39Z

Where did we land regarding the convention for the "failure" / "exception" event?

The issue summary says exception is it's own event, but the recent implementation over in Redix uses stop for both cases...

https://github.com/whatyouhide/redix/pull/165/files#diff-8dffd3b2c1a8adb0b8aef208d4e87dd7R810

josevalim · 2020-04-01T22:56:34Z

We are going with exception. Redix must be fixed. It was probably written when things were still in flux.

whatyouhide · 2020-04-02T18:58:39Z

Redix was fixed: whatyouhide/redix#167

josevalim · 2020-06-13T13:36:50Z

Closing in favor of PR.

mcrumm mentioned this issue Jan 29, 2020

Add installation and telemetry guides phoenixframework/phoenix_live_dashboard#6

Merged

sorentwo mentioned this issue Feb 28, 2020

Improve job-focused telemetry events for span timing oban-bg/oban#185

Closed

GregMefford mentioned this issue Mar 7, 2020

Standardize Endpoint Telemetry events phoenixframework/phoenix#3698

Closed

keathley mentioned this issue Mar 10, 2020

Use the new telemetry event conventions whatyouhide/redix#165

Merged

This was referenced Apr 2, 2020

Adapt to telemetry span convention absinthe-graphql/absinthe#901

Merged

Adapt to telemetry span exception event name convention whatyouhide/redix#167

Merged

mcrumm mentioned this issue Apr 6, 2020

Add some telemetry events oban-bg/oban#195

Closed

binaryseed mentioned this issue Apr 7, 2020

Adding telemetry to dataloader absinthe-graphql/dataloader#89

Merged

sneako mentioned this issue Apr 8, 2020

Emit telemetry events sneako/finch#4

Closed

ananthakumaran mentioned this issue May 3, 2020

middleware to fire telemetry events akira/exq#414

Merged

akoutmos mentioned this issue Jun 1, 2020

Updated telemetry events meltwater/gen_rmq#189

Merged

3 tasks

binaryseed mentioned this issue Jun 4, 2020

[WIP] Telemetry via cowboy stream_handler elixir-plug/plug_cowboy#39

Closed

akoutmos mentioned this issue Jun 5, 2020

Added span/3 function #61

Merged

josevalim closed this as completed Jun 13, 2020

Add telemetry:span/3 #57

Add telemetry:span/3 #57

Comments

josevalim commented Jan 15, 2020 • edited Loading

arkgil commented Jan 15, 2020

mcrumm commented Jan 15, 2020

binaryseed commented Jan 15, 2020

josevalim commented Jan 16, 2020

tsloughter commented Jan 19, 2020

josevalim commented Jan 19, 2020

tsloughter commented Jan 19, 2020

GregMefford commented Jan 19, 2020

GregMefford commented Jan 19, 2020

hauleth commented Jan 19, 2020

josevalim commented Jan 19, 2020

garthk commented Jan 20, 2020

GregMefford commented Jan 20, 2020

GregMefford commented Jan 20, 2020

josevalim commented Jan 20, 2020

GregMefford commented Jan 20, 2020

josevalim commented Jan 20, 2020

GregMefford commented Jan 20, 2020

bryannaegele commented Jan 20, 2020

josevalim commented Jan 20, 2020

tsloughter commented Jan 20, 2020

bryannaegele commented Jan 20, 2020

josevalim commented Jan 20, 2020

arkgil commented Jan 20, 2020

bryannaegele commented Jan 20, 2020

GregMefford commented Feb 14, 2020

josevalim commented Feb 14, 2020 • edited Loading

bryannaegele commented Feb 14, 2020

josevalim commented Feb 14, 2020

binaryseed commented Feb 25, 2020

josevalim commented Mar 9, 2020

binaryseed commented Mar 9, 2020

tsloughter commented Mar 9, 2020

josevalim commented Mar 9, 2020

bryannaegele commented Mar 9, 2020

binaryseed commented Mar 9, 2020

tsloughter commented Mar 9, 2020

bryannaegele commented Mar 9, 2020

tsloughter commented Mar 9, 2020

hauleth commented Mar 9, 2020

GregMefford commented Mar 11, 2020

GregMefford commented Mar 12, 2020

binaryseed commented Apr 1, 2020

josevalim commented Apr 1, 2020

whatyouhide commented Apr 2, 2020

josevalim commented Jun 13, 2020

josevalim commented Jan 15, 2020 •

edited

Loading

josevalim commented Feb 14, 2020 •

edited

Loading