-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce trace/computed_trace metrics #46
Comments
Thanks for getting the conversation started. ❤️ BackgroundTo hopefully save 1000 words, I've put together a picture so we're on the same page about the data model of what we're talking about: Each box is a "span" that has a size defined by the time at which is
|
Thanks @GregMefford for the follow up and the fantastic image! Replies below. trace
Yes! Note though I am not yet convinced on something like this: metrics = Enum.concat([
Phoenix.telemetry_metrics(),
Ecto.telemetry_metrics(),
:brod.telemetry_metrics()
]) I understand the rationale (which is to give users something that works with zero configuration) but I am struggling to balance this with giving users control over exactly which data and metadata they want to send with what precision. We are planning to solve this for now with docs and by making future Phoenix apps generate a pre-built structure out of the box. Regarding the function name, I am fine with either computed_trace
I understand it. I can't speak for Absinthe, but maybe it is hard to trace simply because the traces were not added? Regarding Ecto though, the query event is always a "sink", there is no inner span that can help in there because it comes from a low level library called DBConnection. I am also not sure if we want to leak the implementation details: if in the next version we decide to |
I think that is probably the case. The difficulty that I've heard people mention is that the declarative DSL they use does a lot of work at compile-time, so it's difficult for a user to insert meaningful hooks to add tracing and see what is using up the CPU time across the query process. I only bring it up because maybe it's a similar situation to Ecto's.
I think Ecto should not worry about leaking DBConnection details, but if DBConnection (or Potsgrex, etc.) is the library doing the meaningful work, it should also trace itself and Ecto would only trace things at the level it is responsible for. Here is an example using made-up module/function names because I'm not familiar with how it actually works. My point is that what I want these traces to tell me is how it really works (at some useful level of abstraction) and how my CPU is spending its time during a request with as few un-accounted-for gaps as possible. If you want to change the internal implementation to parallelize things, that's fine and I probably won't mind if I see the change in my APM tool when I update the library. If I have monitoring dashboards that are looking as details like DBConnection checkout latency, then I will have to make sure I'm paying attention to changes at that level anyway, because I have no other way to make sure my mental model of the system is accurate in order to monitor and alert on useful metrics. |
That clarifies a lot, thanks @GregMefford. Regarding the proposal, should we just go ahead with span/1 then? Is it good enough? Or should we wait a bit? I still need to think about how we would make the above become a practice in Ecto (and how to make it in a backwards compatible way, which is the hardest part). Adding outer spans (the Ecto.Repo.all above) is relatively easy but breaking the inner parts will be a bit harder. |
Not sure if this thread is primary to the discussion, but I'm curious why people think Absinthe is hard to trace. I've added Telemetry hooks into the library itself, and we fire off Telemetry events at the start & end of various actions... This code is in [:absinthe, :resolve, :field, :start],
[:absinthe, :resolve, :field],
[:absinthe, :execute, :operation, :start],
[:absinthe, :execute, :operation] One thing I've observed is that some people assume that their tracing data should wind up looking like their intuition of whats happening (in Absinthe without knowing how it's implemented, this looks very much like the "tree" of the GraphQL query itself). However, what we are tracing is the code. Execution of an Absinthe GraphQL query doesn't resemble a tree - it's basically a struct that is mapped over many time in Phases... We should embrace that we are tracing the code execution - it's the whole point, we need to see exactly how it is behaving to know how to improve it. |
Hi @binaryseed, this is great news, thanks for your work! Maybe that will fully address the problem. My only concern is that we are planning to standarize on start+stop, so maybe we should change Abinsthe while it is still on beta? So it would be this:
Wdyt?
Duly noted! |
Oh that's good news @binaryseed! I think the issue is that people were trying to do it from outside Absinthe itself, which IMO makes less sense than doing it from inside the library, but people don't generally feel like it's in their purview to change the library in order to trace their app code. 😅 Regarding
Yes, that's the point I was also trying to get across - it's sort of different from "official metrics," the purpose is to show what is really happening, not a model of what is happening. |
Absinthe PR sent here: absinthe-graphql/absinthe#782 Regarding supporting span/2, I am not sure. Because if we add special cases, then |
@josevalim if the returned data structure defining the span is the same (i.e. in both cases it contains a complete name of a "start" and "stop" events), then I think it's okay to support this scenario. |
So maybe we can allow start_event and stop_event options. So we support it
if someone really needs to, but we don’t make it first class.
--
*José Valimwww.plataformatec.com.br
<http://www.plataformatec.com.br/>Founder and Director of R&D*
|
But the first argument is still gonna be there, which might be a bit confusing: span([:library, :call], stop_event: [:library, :call, :done]) I have a feeling that all metrics definitions in the library are a bit too complex because of many options we have, and I'd like to avoid that in this case if possible. But I understand that if we want to make |
On the other hand... if the API is just to always specify the full name of the start/stop using these Telemetry.Metrics structs, then there’s no need for the start/stop convention, right? So in that case, we would simply have a |
We are assuming that you will have to frequently specify both One of the goals of |
The benefit of having a standardized The value is that it’s a simple and standard API that isn’t vendor- or use-case-specific, and could provide a zero-configuration way for users to just “measure everything” if that’s what they want. It sounds like you’re not sold on that idea, but I’m not clear on what the alternative docs-based approach looks like - would app developers need to copy/paste all the metrics definitions into each app they write, with the default being no instrumentation? To me, that sounds like a lot of maintenance work (e.g. when new metrics are added) when I expect that most people would end up just copying the whole list from the docs anyway. |
Note that NewRelic is all in on opentelemetry https://blog.newrelic.com/product-news/observability-open-instrumentation-opentelemetry/ I know the chances are low that there won't be other tracing impl for Elixir but at least vendors are gathering behind OTel. I'm not sure what this actually means Telemetry.Metrics and telemetry but, counter to what I had thought even yesterday, it might be better to standardize on the OpenTelemetry API as the "Elixir tracing api". Not sure what that would look like, I think it is beneficial to be able to output events and the user only optionally include OTel and have events from telemetry turned into spans. But any "API" I'm now thinking should be the OTel API. |
@GregMefford good point. Good news is that what you are expecting and what I am expecting are orthogonal. In this case we can definitely have our cake and eat it too. We can standardize as much as possible and still allow libraries to return their own spans. So I would still advocate for span/1 and hide the more complex API behind options. @tsloughter my concern with depending on specific tools directly is because they will undoubtably change. The whole purpose of telemetry is exactly to be small so we can isolate ourselves from external factors. I know where telemetry will be in 10 years, because it is just 3 modules that do not have change, I can’t say the same about OT (and it is not that I am afraid it will disappear, it is just that it will change). Also it seems the current telemetry events can already be used for tracing, so I would propose to improve what we can in this area. |
Another thought, I realized the use of events like The way spans work in opentelemetry/census-erlang is there is only 1 current span per process. When you finish a span you are finishing the current span. Otherwise you would need to keep the span context in a variable to pass to finish manually. I bring it up because I see likely confusion if it appears you can, for example, stop the last started Ecto span with |
I understand the desire for the simple telemetry library for emitting events and handling the. But starting to think again about the overlap and duplication of work that is likely to occur, or already has in some cases -- right now mainly in the area of reporting metrics which will have one for Telemetry.Metrics and one for OpenTelemtry. |
@tsloughter well, could this be solved by a telemetry_metrics backend for OpenTelemetry? This way instead of telling people to implement a bunch of backends... we just tell them to use OpenTelemetry? We can even settle on OpenTelemetry for reporting altogether (without metrics) as long as collecting and reporting are two separate things. I just want the libraries to be able to rely on a small, almost immutable dependency as a building block. |
@tsloughter Is this a rule of OpenTelemetry as a whole, or just a For example, in my New Relic agent, a process itself is a span, and any number of spans can be created in that process, and they will nest themselves according to how they were called. This nesting structure is what gives Spans their descriptive power, since their nesting mirrors the call stack structure of the code. It doesn't seem like we'd want to elevate a restriction like that into |
@binaryseed it may have just been confusing how I wrote it. By "current span" I mean top of the stack, the one that will be popped and finished when I forget if this is a requirement. It is consistent with how most the implementations work since most create a "scope" of some sort, execute a body of code within that scope and when that scope ends it finishes the span. You can create as many other spans within the body of that scope but they get their own scope as well. |
For example in Java: // 4. Create a scoped span, a scoped span will automatically end when closed.
// It implements AutoClosable, so it'll be closed when the try block ends.
try (Scope scope = tracer.spanBuilder("main").startScopedSpan()) {
System.out.println("About to do some busy work...");
for (int i = 0; i < 10; i++) {
doWork(i);
}
} Or what we did in opencensus-elixir: with_child_span "traced" do
:YOUR_CODE_HERE
end For opentelemetry-elixir this will change to be consistent with the opentelemetry API: start_span "traced" do
:YOUR_CODE_HERE
end |
There's a crucial piece of Tracing as a whole that I haven't seen mentioned - the trace context needs a mechanism for being passed outside the app. The most common case is to pass the trace context when making an external HTTP request, which enables connecting together a Distributed Trace. Perhaps it belongs somewhere else (probably in Each "type" of external might have a different mechanism. HTTP has headers, but message queues, database queries, etc all have different mechanisms for passing metadata alongside the main message. I imagine something like this, which could be as simple as a telemetry callback that returns a value: defmodule HttpClient do
def request(url, body, headers) do
extra_headers =
:telemetry.decorate([:httpoison, :request, :headers], %{url: url, body: body, headers: headers})
validate(extra_headers)
do_request(url, body, headers ++ extra_headers)
end
end In the New Relic agent for example, I'd attach to that and inside the function i'd determine if there's a current trace and if there is, i'd generate and return a header that encodes the trace context. (Right now this is manual instrumentation that the user must do at every HTTP call site. That's a very rough sketch, very open to other ideas, but I'm curious what folks think |
@binaryseed I agree, this needs to be addressed. I'm going through this pain at work right now. Are y'all planning to adopt the coming standard trace header? https://w3c.github.io/trace-context. This would make things very simple for everyone if we could just have the clients adopt the standard versus having to support vendor-specific sets of headers. |
@binaryseed that's interesting. Let's open up another discussion for that on the |
I agree that context propagation is a critical part of tracing and it would be wonderful to have hooks in the clients for it instead of having to wrap around them in client code or in an integration library. I haven’t used it myself but I’ve heard from co-workers that Tesla’s middleware makes this very simple to do, for example. I think the goal of Mint is so be as simple as possible so maybe this framework would belong in some higher-level abstraction like Mojito or something, or just an extension API that you could easily plug into OpenTelemetry via configuration instead of wrapping it manually. It’s not clear to me how this would make sense to include in Telemetry. To me, that’s getting closer to the point where Telemetry should just be replaced with OpenTelemetry. @josevalim my understanding is that OpenTelemetry will standardize on the trace-context spec linked above, which is the result of converging the various vendor-specific protocols that all basically worked the same way but weren’t compatible due to naming differences, etc. Something else to keep in mind with context propagation is that you may not always want to propagate trace context from callers or to downstream calls (maybe you consider it proprietary and don’t want to expose your trace IDs, don’t want to risk visibility into your internal infrastructure, etc.). So any integrations should make it possible to opt-in or opt-out of tracing on particular calls as needed. |
@GregMefford very good point. In that case it is probably best to have convenient APIs for specifying those in Mint, Hackney, etc. |
Yes, opentelemetry is standardizing on trace-context. Clients having the ability to set functions that are called and return headers to include in requests would be useful. Another possibility is to standardize on an internal context propagation which clients can then be configured to propagate based on the type of client they are or implement their own, of course. So mint, hackney, etc would have a configuration boolean for propagation (likely need more complicated options, like based on path/host/whatever), and when true they call the context's serialization function for the "http" format. |
I opened an issue for discussion of context propagation (aka mutable events) beam-telemetry/telemetry#48 |
After the meeting today, my proposal is to go ahead with |
I'm thinking about the goal of having a single library to point to for user's needs and I don't think this is going to work out. I like the idea of Context propagation is part of that, but not the only issue. One example is OpenTelemetry
But Basically I want to be able to have UI's like this from LightStep: Where the circles aren't necessarily nodes but processes in nodes and the latency is the time calls are taking from the other service that called it. It is all kind of hand wavey right now and I don't know that it'll actually end up being useful, but it is what was most recent in my mind for shit an instrumented library might want to do (and that we'll likely encourage) that they'll have to use OpenTelemetry API directly for. The reasons against OpenTelemetry as the "blessed" way that we point everyone to are completely right, so I think we must instead simply not expect to be able to provide users with suggestion of a single library they can use for everything. |
As for context propagation, I saw a great demo of it in .NET at kubecon which was also covered in this blog post https://devblogs.microsoft.com/aspnet/improvements-in-net-core-3-0-for-troubleshooting-and-monitoring-distributed-apps/ Part of the post is explicitly about OpenTelemetry, but if you look at the section "Passing additional context" the context, including Breaking out context and propagation to its own library is an option. |
@tsloughter maybe I am too naive... but it seems like everything you described above could be implemented on top of telemetry? The user could say:
and the open telemetry handler for that library would be able to access whatever was given in set_context. Or am I missing an important requirement? |
Do you mean "user" as in the end user who is developing the final release that is deployed? If so, then, no. I'm saying there are cases that the library developer needs to do it. Like for setting the library's name for the tracer used within it or setting the name of the service in a gen_server's |
I see. I think we would be totally fine with a separate context propagation
library. Or adding it to telemetry. But as far as I know we haven’t reached
a consensus yet on how it should be inplemented?
--
*José Valimwww.plataformatec.com.br
<http://www.plataformatec.com.br/>Founder and Director of R&D*
|
In an effort to wrap up this thread/issue, let's try to agree on the next steps:
I agree with @tsloughter that this simple API is unlikely to be the only library that we recommend folks use in their apps and libraries (or else OTel likely wouldn't be as complicated as it is), but I think it should be something simple that they could use as a minimal baseline recommendation. If they find that these simple APIs don't meet the needs of their users, then they can reach for OTel instead or in addition. If you agree with all these points, please react with 👍 so that we know you have no further concerns. If you want to clarify something, please quote-reply and we can iron out the remaining details. |
We should look at which libraries have adopted the
|
Agreed.
This has two major issues:
Given those two, I would actually vote to keep the start/stop convention. |
Ah right! |
I asked this in the last meeting minutes but I should raise it here for more public comment. How do we want to handle exception events? Is the previous OT convention of Should |
That is actually a pretty tricky question, because you need to decide whether to send an error telemetry event, raise an actual BEAM exception, or both. In the case of an actual exception, does the library have to guarantee that it has already fired a I think an |
The exception aspect is what I was trying to figure out. The original thought (possibly flawed) is that we could listen for an So touching back on the previous comment I made, my proposal would be a standard convention of:
|
Unfortunately none of stop or error are guaranteed to be emitted because the process can always terminate due to a broken link. The only way to catch those would be by having a separate process as a monitor but that may be too much overhead. So given the lack of guarantees, I would do this:
It may be a bit on the loose end but those are the minimum guarantees that we can provide. Thoughts? |
I think if we want this to work without having to manually pass around lots of tracing context, we have to have a gaurantee that every For example:
The way that we deal with this problem in Spandex (and I imagine most of the other APM tracing integrations do the same), is to encourage developers to always wrap the traced code in a The alternative I see is that you have to hang on to the context created when a library fires a |
One thing with exceptions is that with the OTP 21+ That gets you access to the process & whatever data is inside the exception. Helpfully in |
Good call. So if we want to enforce stop, then we have two options to recommend to users:
I would prefer to recommend 1 due to simplicity. But can the tooling make any use of the error event? For example, would it be able to say that a span failed or similar? |
Yes, the idea would be that the error event would always come between a particular I was asking myself whether the library author would still need to think about where they want to send an |
Should Telemetry include a wrapper function that does the |
This is where macros and anonymous functions are nice. In OTel for Erlang we offer a function version of start_span "span-1" do
... stuff...
end |
Still think this is getting out of scope for what telemetry should be -- a simple events firer and subscriber. |
I lean towards your conclusion, @tristansloughter. Is it worthy of us
adding a new library under the telemetry namespace to keep a generic set of
macros? Or do you think we should leave that to the libraries? Part of me
thinks we should define these conventions for portability.
…On Tue, Dec 24, 2019, 10:16 AM Tristan Sloughter ***@***.***> wrote:
Still think this is getting out of scope for what telemetry should be -- a
simple events firer and subscriber.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#46?email_source=notifications&email_token=ABLXHVOAXJ4DIXD4WKM3733Q2IYV7A5CNFSM4IWHSRBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHTMXAY#issuecomment-568773507>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLXHVJJH6FRKCGPYYZGSVTQ2IYV7ANCNFSM4IWHSRBA>
.
|
Yeah, my opinion is that if we want to recommend that people use Telemetry instead of using OTel directly, we should provide a convenience function/macro to make it much easier to do the right thing (i.e. library authors don’t have to learn the hard way that they’re not guaranteeing the If Telemetry is only to be a simple metrics/events-firing system, then I think it won’t be suited for tracing support, and we’d need to recommend that people use OTel directly (at which point, they should probably use it for metrics as well). |
IMO this is hard because the measurements in start is different from stop. You have at least duration and different time values. We could standardize those but it wouldn't be enough. For example, in Plug/Phoenix, you want to include the update connection, status code and what not on stop. So I agree with the later conclusions. I think the best for now is to add |
Ah right, I forgot about various metadata being attached to the events. I’m 👍 on just defining the conventions then, and having docs about how libraries should ensure that spans are balanced via |
I have created a new issue here that reduces the scope of the discussion so we can finally add span/2: #53 |
@arkgil @bryannaegele @GregMefford @tsloughter
Tracing has been a hot topic lately, in special regarding to what the conventions should be in telemetry. Although tracing would not be used by some reporters, it will likely be used by open census, APMs, spandex and others. This is also a good opportunity to standardize on tracing conventions (which we would have to document on
trace
).Therefore I propose two new functions to be added to Telemetry.Metrics:
trace
andcomputed_trace
.trace
Trace is used when there are two telemetry events:
foo.bar.start
andfoo.bar.stop
. The metric will be defined with the event prefix:computed_trace
A computed trace is a single event with a duration that will be retrofitted as a trace. This would be useful projects like Ecto. The metric will receive the name of a full metric + meaasurement with the duration:
I am not a big fan of computed trace, because it sounds like almost it is a good thing but, computed traces are most always a hack, so it would be nice to have a name that better reflects that.
retrofit_trace
sounds way too horrible though.Comments
Thoughts?
The text was updated successfully, but these errors were encountered: