-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add probability sampler details #331
Closed
Closed
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basing the algorithm on trace id should not be a requirement. It's an optimization that is possible in some languages, but could be pretty expensive in others, considering that trace id is just a blob of bytes that needs to be interpreted as a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a performance optimization (at least not primarily). Using the trace ID as the only input (and the same algorithm) ensures that all spans of the trace have the same sampling decision, which is required to avoid broken traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that the root span of a trace might use a probability sampler, and all the descendants would use the sampled bit from context, so there's only one decision, and I agree with @yurishkuro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also expect the sampler to respect the sampled bit by default, but we say here that it should be possible to configure the sampler to apply to all spans, or any span with a remote parent.
L119 actually says the default behavior should be to make a new sampling decision for spans with remote parents, but I don't know the motivation for this, and think there's a good argument for restricting this to root spans by default.
One benefit of this sampling algorithm is that lower-probability samplers always sample a subset of traces of higher-probability samplers. Because of this, each service could set its own sampling rate and we'd still get complete traces for a percentage of requests equal to the lowest sampling rate. For example, if we have three services with sampling rates
.3
,.2
, and.1
respectively, we'd get complete traces for.1
of requests. With independent random samplers we'd get traces for.006
of requests.Even if we didn't want to make it possible to change the sampling rate along the call chain, there are some clear benefits to making the sampling decision deterministic. E.g. checking that telemetry is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are good arguments. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, @yurishkuro's complaint is that we have to convert the byte array into an int to do the comparison. FWIW an implementation could compute the bound as a byte array instead and do a left-to-right bytewise comparison (see this gist for an example in python), but it's not clear that this would be faster than converting to an int, especially if the language boxes the bytes to do the comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeterministicProbabilitySampler
is a mouthful. I'd preferPercentageSampler
orRateSampler
, but I thinkProbabilitySampler
is a fine description, and has the benefit of having the same name in OpenCensus.I can think of two obvious risks for people doing statistical inference on sampled traces: one is that trace IDs aren't even nominally randomly distributed. For example, if the trace ID includes a timestamp fragment, the sampled request rate might look cyclical or bursty even if the actual request rate is constant. The other is that the service itself handles requests differently depending on the trace ID.
But these are bugs in the service or instrumentation, not problems of insufficient randomness.
AFAICT we don't actually need strong statistical randomness here. We just need the "random" part of trace ID to be roughly uniformly distributed over all possible values, and not to be correlated with the stats we're monitoring (latency, error rate, etc.).
The fact that PRNGs are more P than R doesn't seem unusually bad in this context. And in any case a
RandomProbabilitySampler
would suffer from the same problems.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if sampler does not depend on the trace-id at all? There is no need to consistently select the same trace for sampling as the sampling decision is mostly made when root span is created.
Sampler can simply increment an atomic counter by 1 for every root span is created. If sampling rate is 0.01 then trace is sampled if the count is multiple of 100. Counter can be initialized to a random value once, so across different instance/process they are not synchronized. This sounds more like RateSampler though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this above in #331 (comment). We want to be able to make new sampling decisions for remote parent spans, which is why it's helpful to sample consistently.
Now that I read your description,
RateSampler
is clearly a better name for that than the behavior I'm describing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RateSampler
would imply time dependency, like rate limiting, this is more of a RatioSampler.