-
Notifications
You must be signed in to change notification settings - Fork 166
RFC for "zero-touch telemetry" requirements #5
Conversation
sigh... I got a new laptop last weekend and need to fix my Git config (re the CLA failure). Will do... time passes... done. |
As mentioned on Gitter I support the activity if it covers autoinjection and auto instrumentation. |
@@ -0,0 +1,57 @@ | |||
# (Open) Telemetry Without Manual Instrumentation | |||
|
|||
_Cross-language requirements for automated approaches to extracting portable telemetry data with zero source code modification._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add injection of tracers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AloisReitbauer can you be more specific about the requirement here? I can interpret "injection of tracers" in various ways and don't want to misunderstand/misrepresent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below on specifying the trace at program startup and OpenTel being able to use this parameter to set - and load - the trace at runtime.
* Licensing must be permissive (e.g., ASL / BSD) | ||
* Packaging must allow vendors to “wrap” or repackage the portable (OpenTelemetry) library into a single asset that’s delivered to customers | ||
* That is, vendors do not want to require users to comprehend both an OpenTel package and a vendor-specific package | ||
* Explicit, whitebox OpenTelemetry instrumentation must interoperate with the “automatic” / zero-source-code-modification / blackbox instrumentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the far biggest issue right for automatic instrumentation. Unfortunately, it is a bit more complicated than just managing spans. The biggest problem we have is double instrumentation i.e. the same action is instrumented with blackbox and whitebox instrumentation - as you call it.
Detection and disabling is one thing, but maybe one instrumentation augments the information of another. How do you want to handle these situations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not planning to make augmentation a requirement for our first take on this problem. I'll add it as a nice-to-have, though, as it certainly would be... well, nice to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me. I think we need to get started on this. As we see more real-world use cases of double instrumentation we will also better understand how to solve the issue. Currently, there is quite some research going on at Dynatrace to figure out how to make this work best.
Components in OpenTelemetry might already solve a big part of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that all makes perfect sense and I agree wholeheartedly with
Components in OpenTelemetry might already solve a big part of the problem.
"Components" (or "plugins" or whatever we call them) are a crucial abstraction, especially when mixing and matching various forms of instrumentation.
|
||
### Nice-to-have properties | ||
|
||
* Run-time integration (vs compile-time integration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us this is a must have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this here for two reasons:
- Some other folks on the original GitHub issue (or maybe it was the google doc?) were suggesting we drop the requirement to make v1 easier to build, and
- More fundamentally, in languages like Go it's unclear how this would work without forking the compiler. So it felt odd listing this as a requirement
Certainly runtime integration is preferable, all other things being equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are discussing this with the Go team right now. Happy to bring them in.
I also believe it, there would be an easy way for dynamic tracer loading using an environment variable for the tracer to load and the OpenTel library to load this tracer. We need to define how situations are handled when a tracer is compiled in.
Generally, I would propose to only compile the code with a noop tracer and provider the actual tracer to use as a parameter.
Going a bit deep into the weeds here, but I want to clarify my point and what I would like this group to look at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helpful, thanks for clarifying. Speaking only for myself, I feel uneasy mandating that Tracers are loaded via environment variables only, though I certainly acknowledge the benefits of structuring things that way in certain scenarios.
Re this point:
We are discussing this with the Go team right now. Happy to bring them in.
I think that might be premature, at least for this high-level multi-language RFC. For Go specifically, it would be fantastic to get them involved (and I imagine the various Googlers involved with the project can help with that, too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not mean only, but there should be a well-defined way to load via environment variables.
|
||
## Trade-offs and mitigations | ||
|
||
Approaching a problem this language-specific at the cross-language altitude is intrinsically challenging since "different languages are different" – e.g., in Go there is no way to perform the kind of runtime interpositioning that's possible in Python, Ruby, or even Java. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely true. Dynatrace has runtime instrumentation for Go :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a dig on Dynatrace, but I'm sure it doesn't work as well or as flexibly as it does in Java... so I would stand by the "that's possible in" bit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No offense was taken :-). I believe we agree on the problem. We - meaning all tracing tool providers - need to invest a lot of magic to make this work. We often use undocumented features or even (security) loopholes of platforms to make this work.
* Licensing must be permissive (e.g., ASL / BSD) | ||
* Packaging must allow vendors to “wrap” or repackage the portable (OpenTelemetry) library into a single asset that’s delivered to customers | ||
* That is, vendors do not want to require users to comprehend both an OpenTel package and a vendor-specific package | ||
* Explicit, whitebox OpenTelemetry instrumentation must interoperate with the “automatic” / zero-source-code-modification / blackbox instrumentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding interop, it'll be helpful to clarify the versioning/compatibility story open-telemetry/opentelemetry-specification#115.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reyang no argument on this at all... open-telemetry/opentelemetry-specification#115 is unresolved at the moment, so difficult to incorporate here, but certainly the versioning story (and "component"-ization story as @AloisReitbauer has brought up here) is/are a big part of the challenge for interoperable whitebox+blackbox instrumentation.
* Packaging must allow vendors to “wrap” or repackage the portable (OpenTelemetry) library into a single asset that’s delivered to customers | ||
* That is, vendors do not want to require users to comprehend both an OpenTel package and a vendor-specific package | ||
* Explicit, whitebox OpenTelemetry instrumentation must interoperate with the “automatic” / zero-source-code-modification / blackbox instrumentation. | ||
* If the blackbox instrumentation starts a Span, whitebox instrumentation must be able to discover it as the active Span (and vice versa) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the whitebox instrumentation has multiple tracers, how to define the "active span"?
Probably we need a way for the blackbox instrumentation to discover the tracers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reyang I don't know what you mean by "multiple traces" – do you mean "multiple tracers"? If so, I was of the understanding (feel free to correct me 😉) that the "multiple tracer" scenario is more of a "tracer multiplexer" where there would still be only a single active span.
Probably we need a way for the blackbox instrumentation to discover the tracers?
Almost certainly, yes, though IMO that's "an implementation detail," albeit an important one. I.e., I wouldn't call it a "requirement", just something that we'll need to do in order to satisfy the requirements (if that distinction makes sense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks. (and you're right, I've made a typo, it is "multiple tracers", I've updated the comment.
* Explicit, whitebox OpenTelemetry instrumentation must interoperate with the “automatic” / zero-source-code-modification / blackbox instrumentation. | ||
* If the blackbox instrumentation starts a Span, whitebox instrumentation must be able to discover it as the active Span (and vice versa) | ||
* Relatedly, there also must be a way to discover and avoid potential conflicts/overlap/redundancy between explicit whitebox instrumentation and blackbox instrumentation of the same libraries/packages | ||
* That is, if a developer has already added the “official” OpenTelemetry plugin for, say, gRPC, then when the blackbox instrumentation effort adds gRPC support, it should *not* “double-instrument” it and create a mess of extra spans/etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the developer took whitebox approach and configured a specific exporter/propagator? Should the blackbox instrumentation modify/tweak it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what is most likely going to happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is naive of me, but my hope (??) was that both the blackbox and whitebox OpenTelemetry code would be portable and non-vendor-specific. I.e., that the choice of exporter/propagator should not be relevant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally - yes. Our initial research of double instrumentation and potential issues indicates that this is anyways a problem that needs to be solved in backend analytics.
Hi, I'm here to say on behalf of @honeycombio that we're interested in contributing the Beelines (e.g. for NodeJS Express) to OpenTelemetry, which allow for the most common language frameworks automatic hooking. It's not quite "zero touch" but it's also "add one library dependency" rather than add manual spans everywhere. Is this something you'd like me to file a separate RFC for, or should we also discuss it here? |
@lizthegrey this is great to hear! Is there a technical writeup (can just be bullets / etc, nothing formal) describing the design goals for Beelines? Would be particularly interested to understand how they would translate to a Span-oriented worldview. I think the various conversations would be easier to navigate if that was its own thread (separate from this RFC, I mean), but that's just me. |
Hi folks... I'm inclined to merge this in the next 48h if there are no more blocking concerns. To be clear, this RFC can continue to evolve and be PR'd against. I just don't want to leave it in this pending state forever. (cc @AloisReitbauer @reyang @lizthegrey who've all made comments) |
No strong objections to the existing agent/bytecode approach, and looking forward to see how it evolves... I'll start a separate PR for talking about |
I already wanted to ask when we are going to merge. I am fine with it. |
Yeah, I have a reminder set to merge tomorrow AM (PT) barring any additional concerns. |
#7 -- here's RFC 0003 for people interested in the alternate and complementary approach to instrumentation we propose :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the great work!
### Requirements | ||
|
||
Without further ado, here are a set of requirements for “official” OpenTelemetry efforts to accomplish zero-source-code-modification instrumentation (i.e., “OpenTelemetry agents”) in any given language: | ||
* No _manual_ source code modifications allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be relaxed a bit. In some languages it may not be possible to attach instrumentation without some minimal O(1) initialization code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll loosen it a bit. For languages (e.g., Java) where there is a lot of prior art for truly zero-touch instrumentation, I think that should still be our goal, though.
... There are certainly a lot of legacy JARs out there that can only be instrumented from the outside in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(PTAL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge this, but will happily debate this point further in subsequent PRs to the RFC itself.
I wrote up a proposal of how auto-instrumentation could work for OpenTelemetry Ruby here. It covers the distribution, detection and installation of instrumentation, and while it's pretty specific to Ruby, some of the ideas might be applicable to other languages. |
This RFC is a light edit of "Zero-Code-Modification OpenTelemetry Instrumentation: (i.e., "Agent") A Cross-Language Spec" (a google doc). Also see open-telemetry/community#87.