-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial POC version of url-based sampler #66
Conversation
url-sampler/src/test/java/io/opentelemetry/contrib/samplers/UrlSamplerTest.java
Outdated
Show resolved
Hide resolved
I think we have to use http.url, at least as a fallback. It might be faster to check the route first, then fall back to the url. if something has already parsed out that route, we can skip re-parsing it.
Has to be just the path portion, because the host/port will be all over the place. It might be worth benchmark parsing of the URL vs. just using regex on the full url, to see which is faster/fewer allocations.
I think it should, yes. It will be a very fast enum comparison to check against it. Digging into attributes will be slower.
My gut tells me regex, but I'd like to know how fast we can make the ant-style path patching, compared to precompiled regexes. |
I feel that we can avoid route entirely given we're configuring regexes in the rules anyways. We already have enough issues with the URL vs target, etc issue to add the sometimes but not always available route to the mix :P
Parsing out an entire URL is definitely expensive, I think just finding the As for URL vs path, URL is more useful I guess since someone may need to match against host (sample all requests to some external API in a particular way would need the host I guess) - in practice it's rare though so only path could be OK (it solves the highly demanded health check sampling). I guess we could either
We have similar discussion in the instrumentation repo and it really seems like supporting both URL and triplet causes a lot of complexity. Perhaps this sampler is also a motivation to revisit the spec and consider dropping one of these - it's basically spaghetti within the spec, and we may have anticipated it only extending to spaghetti in backends, but I'm starting to see it affect instrumentation and SDK plugins like this one too, potential performance wins from having the choice of a triplet seems to not be worth it potentially.
I've seen ant-style path matching in libraries like Spring implemented by translating to regex - so that's what I did too since X-Ray uses glob patterns From what I understand a version of Java improved regex implementation such that the regex equivalent of a glob pattern performs about the same as dynamic programming instead of potentially going exponential. The advantage of the glob pattern is probably more that it prevents being able to define patterns with random regex syntax like captures / backreferences which can hurt.
While having a configuration to limit to kinds would be fine, I don't think the default to limit to any kind. There's enough use case to filter on client (external API requests, client-side-load-balancing clients which issue health checks). |
Even in the case when any given deployment wants to have both server- and client-side calls sampled I would expect them to have different patterns. Thus every given instance of url-based sampler will probably work with only one kind. |
I don't know how common it is, but there is a pattern of using client side health checks with load balancing. It means that filtering on the standard path like |
@anuraaga I am still thinking about filtering on span kind. But your examples for client side sampling is still about path, not host+path. So to be sure: do you want to match full url or just path part of it? |
@iNikem I think path is much more common - but I still think URL is more useful in general since I can think of plenty of URLs where the host part does have meaning since the host often indicates a type of API. Though realized, why do we make it a url-based sampler anyways - how about AttributeSampler which accepts a map from attribute key to patterns? Even for the path, or url, use case, it doesn't seem much harder to use than one specialized to URLs. The X-Ray sampler supports this if it's a useful reference point (it also natively supports some of the attributes only because X-Ray data model has native fields for them and does not populate them to attributes - but I don't think ours needs that). |
This does not change my question in any way. What I am trying to understand now is this: do we expect an end-user to provide "/healthcheck" or "http://example.com/healtcheck" or ".*/healtcheck" as the configuration for this sampler? Path or full url? |
It means that both something like |
I will try this out, thanks for the suggestion :) |
@anuraaga PTAL Several open questions:
|
I would suggest the same for all of them - don't do anything special, so the pattern is always just applied to the value of the key. It's easy to add |
url-sampler/src/main/java/io/opentelemetry/contrib/samplers/StringAttributeSampler.java
Outdated
Show resolved
Hide resolved
Also would love to hear suggestions for module name. |
private final SpanKind kind; | ||
private final Sampler delegate; | ||
|
||
public StringAttributeSampler(Map<AttributeKey<String>, ? extends Collection<String>> patterns, SpanKind kind, Sampler delegate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may as well keep the delegate sampler together with the pattern, UX should be straight forward with a builder StringAttributeSamplerBuilder.addRule(key, pattern, delegate)
. Easy and opens up easy setting of different sampling rates for different endpoints.
Also careful with Map since order is important, both to determine precedence and to allow a user to optimize performance by (if they want) by having common rules up front. List<SamplingRule>
for example could be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may as well keep the delegate sampler together with the pattern, UX should be straight forward with a builder
StringAttributeSamplerBuilder.addRule(key, pattern, delegate)
. Easy and opens up easy setting of different sampling rates for different endpoints.
You are making it more and more generic/complex :) I usually prefer to start with as simple as possible and then iterate based on the user feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more generic, but is it really more complex? I think "setting a delegate Sampler for a pattern" is about as simple as it gets conceptually, it's almost like just populating a key/value set at that point. And while healthcheck is probably the #1 ask, it is common for systems to have APIs with vastly different QPS, so we can expect the new #1 ask to immediately want to be able to tweak their sampling rate :) I suspect it is worth working on this expectation, it doesn't seem to me to make things overly complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait. I realised that your suggestion is totally different from my original idea. I wanted to have "deny list": if attribute matches one of the patterns, then drop, otherwise delegate to the "main" sampler. E.g. my main sampler for the whole application is ParentBased, but I want to drop "/healthcheck". Deny list.
Your idea is more like "router": match some patterns to samplers. Which means that I have to provides AlwaysOff
sampler for "/healthcheck" pattern and I have to remember to provide "catchAll" pattern to direct to ParentBased
sampler. Router.
We probably may want both samplers, one simpler (including simpler to think about), one more powerful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, doesn't it sort of make sense in this case to start with the more powerful one? I feel it will be a more productive discussion when bringing it to the spec, to talk about the generic sampling router vs the specific use case of denying URLs.
This is also because methods on SamplerBuilder
seem like they can make up the difference between the two, e.g. deny(key, pattern) { addRule(key, pattern, Sampler.alwaysOff()) }
is a simple shortcut that wouldn't require a totally separate Sampler implementation.
url-sampler/src/main/java/io/opentelemetry/contrib/samplers/StringAttributeSampler.java
Outdated
Show resolved
Hide resolved
url-sampler/src/main/java/io/opentelemetry/contrib/samplers/StringMatcher.java
Outdated
Show resolved
Hide resolved
@@ -22,3 +22,4 @@ include(":aws-xray") | |||
include(":dependencyManagement") | |||
include(":example") | |||
include(":jmx-metrics") | |||
include(":url-sampler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since package is samplers
this should also be samplers
I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial idea was to have every single contrib sampler in a separate module. They then can have separate lifecycles and maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - need to imagine more what other samplers there could be. Presumably you mean different modules even for samplers that aren't vendor-specific. Perhaps a rate-limiting sampler. Realistically, would these sampler primitives really have different maintainers? I figured this is prototyping with the intention of moving into the SDK samplers package eventually as these seem pretty "core" for lack of a better term for me. Though then I guess it brings the idea of just using the tracing-incubator too - https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/tracing-incubator.
I'm -1 to the idea of this or similar important Samplers being in contrib long term - they seem like tier 1 components of the SDK to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkwatson WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this at SIG and agree that it's very likely this sort of sampler will end up in the core SDK, after this is a PoC, then spec discussion. So tracing-incubator
is also a fine place to put this sampler and a great use case for that package.
To the initial point of the thread, though, I think we can assume the maintainers for this sampler are the Java maintainers, and that this artifact is temporary as a PoC, so we wouldn't need to generalize our approach to "every single contrib sampler", as we should think of this not as a "contrib sampler" but as a PoC for a "core sampler". @iNikem I guess you can pick whether you think it's easier to work on this in tracing-incubator or a new artifact in this contrib repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once this is officially in the spec, then it should absolutely be in the core repo. Until then, tracing-incubator is acceptable if contrib doesn't work for you.
I'd like to think about having a formal decision-making process on this kind of thing.
Perhaps:
- If we think a new component should be a core component, create a spec issue for it, and do prototyping in the tracing-incubator (eventually metrics-incubator, or some other clearly
-alpha
component in the core repo). If/when the spec issue gets approved, then it can be moved to a normal "extension" or "sdk-extension" artifact, as appropriate. - If we don't think it's going to be a core component, then put it into contrib and make sure it has an owner.
The other question is...how do we prune out failed incubator experiments, or unloved contrib components?
} | ||
|
||
tasks { | ||
shadowJar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to shadow at least at this layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? If I want to grab this sampler and add it as an extension to my agent, then I need a full shadowed jar, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we definitely don't want to shade in the SDK as the agent distro will be bringing it in. This sampler doesn't seem to have any dependencies that we want to shade.
Anyways in your issue you also mention "At this moment I don't propose nor plan any auto-configuration support for this sampler." :P Let's just implement this as we would a normal Java library for now.
private final SpanKind kind; | ||
private final Sampler delegate; | ||
|
||
public RuleBasedRoutingSampler(List<SamplingRule> rules, SpanKind kind, Sampler defaultDelegate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this package private and add builder(SpanKind, Sampler)
static method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to hide this constructor? There is no actual requirement to use builder, isn't there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry missed commenting on this. For libraries we expose as little API as possible to reduce surface. It there is a useful shortcut we could expose it but this doesn't seem like a shortcut compared to the builder. Instead, we can even hide SamplingRule itself from the API too if only having the builder.
url-sampler/src/main/java/io/opentelemetry/contrib/samplers/RuleBasedRoutingSampler.java
Show resolved
Hide resolved
url-sampler/src/main/java/io/opentelemetry/contrib/samplers/RuleBasedRoutingSamplerBuilder.java
Show resolved
Hide resolved
private final SpanKind kind; | ||
private final Sampler defaultDelegate; | ||
|
||
public RuleBasedRoutingSamplerBuilder(SpanKind kind, Sampler defaultDelegate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make package private
/** | ||
* @see RuleBasedRoutingSampler | ||
*/ | ||
public class SamplingRule { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like at least for now doesn't need to be public
Replaced by #70 |
Closes #65
Asking for the following feedback.
First, which semantic attribute should we match against?
http.url
orhttp.route
. The latter is not always available, the first has some problems, see below.Second, should we match the whole url in
http.url
or only path portion? If I configure SDK manually inside my application's code, then I don't know the hostname that my application will respond to. Thus it seems that we cannot match full url, only path portion. But this means that on every call toshouldSample
we have to parsehttp.url
attribute and extract path. This may be very expensive.Third, should narrow this sampler only to
SERVER
spans? The original request described in the issue seems to assume this.Fourth, if want to provide something more than just exact match, which pattern language should we use? Always assume
startsWith
? ant-style path matching? Regex?