Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add telemetry:span/3 #57

Closed
josevalim opened this issue Jan 15, 2020 · 67 comments
Closed

Add telemetry:span/3 #57

josevalim opened this issue Jan 15, 2020 · 67 comments

Comments

@josevalim
Copy link
Contributor

josevalim commented Jan 15, 2020

This is a spin off from the discussion in #56. @hauleth mentioned about providing telemetry:span/3 and I replied that I don't think we can come up with a generally useful API and we should have a separate discussion about it. So instead of me being grumpy, I thought I would open up said discussion and send a proposal:

telemetry:span(EventPrefix, StartMetadata, (() -> {Result, StopMetadata}) :: Result

This will emit three events:

  • EventPrefix ++ [start] with measurement #{system_time := erlang:system_time()} and StartMetadata
  • EventPrefix ++ [exception] with measurement #{duration := erlang:monotonic_time()} and #{kind := throw | error | exit, reason := term(), stacktrace := list()} as metadata
  • EventPrefix ++ [stop] with measurement #{duration := erlang:monotonic_time()} and StopMetadata

The only limitation with this approach is only that we can't configure measurements but I would say configuring metadata is good enough and despite my initial comment, this should cover a good amount of scenarios.

Shall we ship with this approach?

@arkgil
Copy link
Contributor

arkgil commented Jan 15, 2020

I would say configuring metadata is good enough

I agree, that's the most useful measurement when creating spans - I'd say that anything else could go through metadata if needed. Or, one could always emit start, stop, error manually.

For this to become useful, we also need an event consumer which works with this convention by default - I assume that it's Telemetry.Metrics.span from beam-telemetry/telemetry_metrics#53, right?

@mcrumm
Copy link

mcrumm commented Jan 15, 2020

Hi everybody! I'm just catching up on all the latest discussion, but conceptually I really like this -- it makes for a good entrypoint for the basic measurements, and provides some conventional wisdom around event names.

should stop be emitted in case of error?

I would agree no, as it simplifies the metrics definitions if you can differentiate between successes and errors at the event.

@binaryseed
Copy link
Contributor

Even if this helper function isn't used, I like having a standard to follow, especially with the distinct error event.

@josevalim
Copy link
Contributor Author

For this to become useful, we also need an event consumer which works with this convention by default - I assume that it's Telemetry.Metrics.span from beam-telemetry/telemetry_metrics#53, right?

Correct. We also need a way to auto-discover spans which is the other conversation.

@tsloughter
Copy link
Contributor

start_span would be safer than start. Or is the EventPrefix supposed to include something about this being a span event?

@josevalim
Copy link
Contributor Author

@tsloughter once again we are in a pickle because we have kind of standardized on start/stop by using it plug, phoenix, absinthe, etc.

At the same time, I don't think we should call it start_span/stop_span because the events in isolation can also be used. For example, I can do counter([:my_app, :request, :start]) while counter([:my_app, :request, :start_span])` would be a bit weird.

@tsloughter
Copy link
Contributor

Ah, ok. I don't think counter([..., start_span]) doesn't make sense, its a counter that counts how many times a span (operation) is started :). But if start is already in common usage then it isn't worth changing.

@GregMefford
Copy link

I was going to suggest that we always have start and stop events, with the error being optional and not having the same metrics as the stop event, since it's more like an annotation that an error occurred during the span and the emitting library would still have "control" of the process and might do some more work before it officially sends the stop event and "returns control" to the parent span:

image

I was trying to work out for myself how we even expect library authors to be able to send metrics in either the stop or error events. Imagine we want to calculate how many bytes were transferred in some operation that might crash, so we follow the recommended try/rescue/after structure:

  def work do
    try do
      bytes = 0 
      if :rand.uniform(2) == 1 do
        raise "Boom"
      end 
      bytes = 100 
      IO.inspect(bytes)
    catch
      kind, reason ->
        IO.inspect({kind, reason})
        IO.inspect(__STACKTRACE__)
        # bytes not in scope
    after
      # bytes not in scope
    end 
  end 

Am I missing something about how folks are planning to use these events as a transport for measurements?

@GregMefford
Copy link

(sorry for the duped comments - GH was acting weird and saying it didn't post so I tried it again)

@hauleth
Copy link
Contributor

hauleth commented Jan 19, 2020

I agree with @GregMefford that we error event should be independent from start and stop. It IMHO makes more sense. Alternatively the stop event could be edited with error in the metadata if such happened instead of completely separate event.

@josevalim
Copy link
Contributor Author

To be clear, my point is that in case of errors we won’t be able to emit the same metadata as in case of success for stop events. For example, Ecto includes the number of returned results and this number would be unavailable in case of stop emitted after errors. We could use zero, but that would skew numbers.

In a nutshell, my concern is how to always emit stop events without introducing inconsistencies on the metadata, which may skew the overall aggregated data.

It is a technical problem which I don’t know how to solve.

@garthk
Copy link

garthk commented Jan 20, 2020

Could we avoid the hard decisions about which single choice is more confusing if we called :telemetry.execute/4 more than once at the tail end?

  • start at the head
  • stop and stop_normal at the tail when we return a value
  • stop and stop_exception at the tail when we crash

That'd support strong guarantees around the shape of the measurements and metadata on start and stop_normal, for telemetry consumers needing such. It'd also let those interested only in duration and status simply hook up start and stop.

Zooming back out, I'd much appreciate this API also supporting measurement control e.g. for row count.

@GregMefford
Copy link

@josevalim I'm wondering whether you'd even be able to emit the measurements and metadata on success, as shown in the code snippet in my previous comment. You'd want to emit the stop tracing hook in the after block, but you don't have access to the metrics from inside the main function body at that point, unless I'm missing something.

The way we handle that with tracing today is that we can add metadata for a span anytime before the stop, so if we want to attach some metrics or other information (for example, number of rows returned from the database), we do it from the main function body where they're in-scope and then call stop in the after block.

I think the technical solution in :telemetry framework we're talking about would be to simply not have metrics on the same events that are used for tracing, but rather use separate metrics events using whatever naming convention makes sense for the library (e.g. phoenix.endpoint.call with the metrics/metadata that are currently returned in phoenix.endpoint.stop, or firing a separate event for each particular metric like Hackney does). This is of course a breaking change for Phoenix because they're currently on the stop event, but I don't see an easy way around that if we want to have this naming convention for tracing and also attach metrics to it. You could still emit a duration because it can be easily calculated by the time delta from the start to stop, regardless of what happens in the try...end block, but not things like count of DB rows returned. It wouldn't be a breaking change for Ecto, though, because their metric suffix is :query anyway, so they'd just need to add :start and :stop events to the existing metrics for tracing.

This proposal also solves the issue with inconsistent aggregations you mentioned, because then it's up to the library to decide whether to emit the metrics event on error or not. For example, if there was an error while fetching from the DB in one out of 10 queries which normally returns 100 rows, what's the minimum/p99 number of rows returned by those query attempts? The library needs to decide whether the failed attempts count as having returned zero rows, or just don't count as attempts at all (I think you're saying that today, Ecto doesn't emit metrics at all on errors). With tracing though, you need to stop what you started or else it messes up the whole stack, so you don't get to choose not to account for that attempt from a tracing standpoint, but if you have separate metrics events, you can choose whether to fire it or not in case of errors.

@GregMefford
Copy link

I would love for there to be a better option proposed, but one way to work around the breaking change on the Phoenix side would be to keep emitting the metrics in the current phoenix.endpoint.stop event, but add an additional set of events specifically for tracing, like phoenix.tracing.{endpoint,controller,view}.{start,stop}

@josevalim
Copy link
Contributor Author

You'd want to emit the stop tracing hook in the after block

I don't think we need to emit the stop event on after. If you need to include what happens in after (which is not every common in the first place), your can always wrap it in another try/catch. For example, your code could be written like this, where actual_work is the work function you provided above.

  def work do
    try do
      instrument_start(0)
      actual_work()
    catch
      _, _ ->
        instrument_error(error_metadata)
        instrument_stop(no_metadata_here_sad_face)
    else
      result ->
        instrument_stop(result)
    end
  end

So it is very feasible technically to have metadata in the stop event. It is not an actual limitation.

I am really worried that now we are proposing for any "span" to have at least 4 different events: start, stop, error, and one extra event for the metadata. Telemetry is relatively fast but if we are not careful, those will eventually slow things down. To me it sounds much simpler to say a trace ends either with stop or error, for the reasons below:

  1. we reduce the amount of required events
  2. as shown above, emitting metadata on stop does not reduce the quality of the spans
  3. having to listen to both stop and errormakes it slightly harder for reporters, but I would rather push the complexity to reporters (which are maintained by dozens of people) rather than telemetry users (which are much more and we want to increase)

@GregMefford
Copy link

Ok, that works for me. 👍

What do you think about @hauleth proposal of just putting error metadata in the :stop event instead of having a separate :error event? I think it could be nice to only have to think about two types of events, but it might be easier to just pattern match on the :error atom instead of matching on specific shape of metadata. I think I’m leaning more toward having the :error event replace the :stop event when there’s a crash. This “makes sense” to me because exceptions break the normal flow of execution, which “explains” why you get something else instead of a :stop.

@josevalim
Copy link
Contributor Author

What do you think about @hauleth proposal of just putting error metadata in the :stop event instead of having a separate :error event?

I think it has the same issues. Sometimes stop will have :error metadata, sometimes it won't, and when it has the :error metadata, it won't have the usual success metadata too.

We also have the issue between distinguishing "errors" and "exceptions". For example, an HTTP client may return {:error, _}, so that would be a :stop and not an :error. So I think we should keep :error exclusively for "this blew up and you won't have the usual information you usually would - you won't have a stop either". Does that make sense? It is meant to be triggered when we indeed can't submit the stop.

@GregMefford
Copy link

Yes I agree - returning an error from an API is very different than raising an exception and I think the :error event should only be for the latter.

@bryannaegele
Copy link
Contributor

I'm a little concerned with conflating :error with exceptions. I could be proven wrong, but I feel like this could cause confusion since that atom is an idiomatic normal return value.

As to the proposal @josevalim put out, I feel like this pattern covers the majority of use cases. This discussion is leaning towards prescribing patterns that only suit tracing and make an assumption that OTel will be the one and only way that people get metrics. Currently, those are all tied to the :stop event which this pattern covers.

Only question I have is shouldn't that reraise?

@josevalim
Copy link
Contributor Author

It will definitely re-raise. Any suggestions on another name to :error? :error actually means only one failure type in Erlang. Do span technologies or OTel use a specific name for errors/failures/etc?

@tsloughter
Copy link
Contributor

There is currently a "status" field on a span. There is a long discussion about status codes, severity, etc here open-telemetry/opentelemetry-specification#306 which resolves to punt until 0.4 of the spec.

@bryannaegele
Copy link
Contributor

I've seen the usage of "error" in OpenTracing, but that has a specific meaning in those frameworks. Our languages and idioms use that word in a particular way. I'm fine with bringing outside influences into our reporters where those authors have specific knowledge. For the average library author, is it useful to put that burden on them, as well? How would a user of a library interpret seeing that event in the documentation and looking at source code?

Are there any arguments against using :exception which doesn't seem to have any potential for confusion?

@josevalim
Copy link
Contributor Author

exception has a meaning in Elixir and therefore can be potentially confusing too. We could use fail or abort?

@arkgil
Copy link
Contributor

arkgil commented Jan 20, 2020

fail sounds good, as it's none of the failure types in Erlang so it can cover all of them. abort to me looks more like the operation was aborted deliberately.


This discussion is leaning towards prescribing patterns that only suit tracing and make an assumption that OTel will be the one and only way that people get metrics

@bryannaegele why do you think it's mostly suited for OTel for metrics?

@bryannaegele
Copy link
Contributor

My argument against error is that it has a specific meaning and we don't conflate exceptions with errors. An exception event emitted in a catch would feel natural to me.

@bryannaegele why do you think it's mostly suited for OTel for metrics?

Not suited, just that we're framing discussions around the standard events in the context of tracing which is a very advanced concept to the average user. Tracing isn't even a standard practice at most large corporations.

I just want to ensure we keep the number of events as small as possible (3 tops) and approach adding support for tracing on top of the existing metrics support we already have.

@GregMefford
Copy link

Can we settle the last loose end of the discussion on naming conventions: start, stop and (optionally) fail?

I think the issue with the exception name is that it may not always result from an actual Elixir Exception, and abort looks intentional rather than the result of a crash.

@josevalim
Copy link
Contributor Author

josevalim commented Feb 14, 2020

I am happy with "fail" but I believe for all reasons stated previously, our convention should be:

  • A "start" is emitted at the beginning of every span
  • Then ONE OF "stop" and "fail" will be emitted:
    • "stop" is emitted if the function did not throw, error or exit
    • "fail" is emitted if the function did throw, error or exit
  • If the process crashes due to a link, then none of "stop" or "fail" may be emitted

PS: unfortunately I could not attend the meeting yesterday, so if I missed something I apologize.

@bryannaegele
Copy link
Contributor

I'm fine with "fail".

"fail" is emitted if the function did throw, error or exit

What constitutes an error? Any return that isn't a value, ok, or ok tuple?

In the case where we're intercepting an exception and reraising, we have access to the stacktrace. If we're going to raise, we don't have access to the stacktrace at that point, correct?

@josevalim
Copy link
Contributor Author

"throw, error or exit" in this case refers to the throw, error, and exit atoms you can handle inside the catch part of a try. I.e. they are non-local returns.

In the case where we're intercepting an exception and reraising, we have access to the stacktrace.

Yes.

If we're going to raise, we don't have access to the stacktrace at that point, correct?

We are never going to raise ourselves. Or the user supplied function raises or nothing is raised.

@binaryseed
Copy link
Contributor

I made a seperate issue to talk specifically about guidance around timestamp tracking and monotonic times:

@josevalim
Copy link
Contributor Author

Hi everyone! We started getting some PRs that follows the spec here. There is only one problem. Some of them use “time” for the system time. Others use “start_time”. Shall we pick one? I think “foo.bar.start.start_time” has a bit of repetition. So we should rather pick “system_time” or just “time” for the measurement. What are your picks?

@binaryseed
Copy link
Contributor

I could get behind system_time, it has a nice direct symmetry with functions in Elixir & Erlang, so it is easy to determine the exact meaning of the measurement by looking up the docs

@tsloughter
Copy link
Contributor

Will it be left in native units? If not it should have a prefix like system_time_nanosecs.

@josevalim
Copy link
Contributor Author

It will be in native, yes, so let’s go with “system_time”. I have also updated this issue description with the latest “spec”, so in case anyone wants to send a PR for this feature, it will be very appreciated!

@bryannaegele
Copy link
Contributor

system_time is my vote. @tsloughter the spec would be it should always be native units and we'll have control of that in this proposal.

As an aside, I haven't seen the practical usage of that string API and wouldn't mind seeing it deprecated. Users have told me they find it confusing and it doesn't result in a name that obeys guidelines for most metric related tools, so the user is required to add the event_name regardless.

@binaryseed
Copy link
Contributor

@josevalim Should we update the "spec" in the description to be more explicit about the duration being calculated via monotonic_time? I think that's non-obvious to most folks w/o experience in this stuff...

@tsloughter
Copy link
Contributor

Thought I'd mention it one more time. Something like wts that keeps monotonic and offset might be better.

With it you have a timestamp that is both usable for duration and for system time, it just happens to be a tuple instead of an integer.

Then system_time/monotonic_time can both be dropped and just timestamp be given which is a 2-tuple of the monotonic time and the offset.

Telemetry can then just add functions for dealing with them, telemetry:duration(Timestamp1, Timestamp2), telemetry:to_system_time(Timestamp), etc

@bryannaegele
Copy link
Contributor

@tsloughter I didn't think monotonic_time was a thing anymore as a measurement that gets emitted. It's just used in the wrapper for calculating the duration.

@tsloughter
Copy link
Contributor

@bryannaegele aaah, a duration is reported. I see, nevermind :)

@hauleth
Copy link
Contributor

hauleth commented Mar 9, 2020

I would say that all durations/times should have format of non_neg_integer() | {non_neg_integer(), erlang:time_unit()} where raw integer mean that it uses native time unit. There could be telemetry:to_unit(Time, Unit) function that would allow conversion if someone would need that.

@GregMefford
Copy link

Before we get past "the point of no return," I'd like to reconsider whether failure is the right word for the abnormal exit condition as opposed to the normal stop event. There are lots of kinds of failures in libraries and it seems like exception is a better term to use for something exceptional like a process crash, as opposed to a failure like {:error, :no_connection} when you try to make an API call that failed for some reason. This came up as some feedback as we started to integrate this convention in a few libraries. I just want to make sure we align on the right path before we merge a bunch of PRs within our small circle and regret it when we get broader community feedback.

I'll bring it up in tomorrow's Observability WG meeting for feedback.

@GregMefford
Copy link

As we decided in the Observability WG meeting today, we will keep the same behavior discussed here, but rename the :failure event to :exception to make it more clear that this is specifically a case where the normal code path didn't execute to the normal :stop event, and it should always have a stacktrace associated.

@binaryseed
Copy link
Contributor

Where did we land regarding the convention for the "failure" / "exception" event?

The issue summary says exception is it's own event, but the recent implementation over in Redix uses stop for both cases...

@josevalim
Copy link
Contributor Author

We are going with exception. Redix must be fixed. It was probably written when things were still in flux.

@whatyouhide
Copy link
Contributor

Redix was fixed: whatyouhide/redix#167

@josevalim
Copy link
Contributor Author

Closing in favor of PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

10 participants