Probability sampling basics for telemetry events #148

jmacd · 2021-03-06T08:02:50Z

This OTEP defines a foundation for probability sampling in OpenTelemetry.

This drafts a specification for how to encode the sampling probabilities in a span, to enable statistical summarization from sampled traces using the sampler.name and sampler.adjusted_count attributes.

text/0148-sampling-probability.md

jmacd · 2021-03-11T20:16:23Z

I had an off-line meeting with @oertl, who mentioned that Dynatrace has a field named multiplicity used to convey sampling information. This is a term I had not considered, and one that I like.

paulosman

I really like this proposal, thank you!

The described approach solves two particular use cases that I think come up quite often:

Allowing a tracing system to easily estimate the size of a population (in visualizations, counts, etc)
Allowing clients (probably through tail sampling) to make dynamic sampling decisions

I've made some suggested changes, one that should make the linters happy and a couple of minor edits.

text/0148-sampling-probability.md

Co-authored-by: Paul Osman <paul@eval.ca> Co-authored-by: Steven E. Harris <seh@panix.com>

oertl · 2021-03-20T10:50:03Z

Maybe we should also consider allowing only discrete sampling rates. I'm thinking in particular of powers of 1/2 (1/2^0, 1/2^1, 1/2^2,1/2^3,...). In my opinion, this is not a big limitation in practice, but would have some really nice benefits:

A single byte for the exponent would be sufficient to represent the sampling rate.
The extrapolation factor (sample_count) would always be an integer. When extrapolating integer quantities such as counts, floating-point operations could be completely avoided and the estimate would be an integer as well.
Trace ID ratio based sampling would be much simpler. The sampling decision could be made simply based on the number of leading zeros (NLZ) of the trace ID. If the NLZ is equal to or greater than the exponent of the sampling rate, the span would be sampled. Only a few cheap CPU operations would be necessary for a sampling decision. Furthermore, by avoiding floating-point operations, sampling decisions would become more consistent across different platforms.

jmacd · 2021-05-04T19:27:20Z

Please take another look. @oertl I included your recipe for encoding inclusion probability as a good option. This OTEP no longer makes a specific recommendation about how to encode this information, only that when this information is encoded we know what it means.

text/0148-sampling-probability.md

oertl · 2021-07-20T16:11:20Z

Dynatrace has published a paper on partial trace sampling with a focus on the unbiased estimation from incomplete trace data https://arxiv.org/abs/2107.07703. It provides arguments for limiting the sampling rate to powers of 1/2 (see section "2.8 Practical Considerations").

jmacd · 2021-07-21T20:40:33Z

@yurishkuro Do you feel that we can merge this OTEP? I believe I addressed your concerns, and any remaining concerns or matters of opinion can be ironed out as we move on to update the specification.

The recent sampling SIG meeting, summarized in open-telemetry/opentelemetry-specification#1819, found little objection over the contents of this OTEP. We seem to have reached a consensus about the use of TraceIDRatio sampling.

jmacd · 2021-07-21T21:02:06Z

@oertl Thank you for posting your research. Partial Sampling is a fantastic addition to the state of the art, and now I understand why you've been proposing power-of-two sampling rates. The "negative base-2 logarithm" topic is mentioned in this OTEP already, and I can add more current information if that will help us merge this and move on.

I take it you would like to see the specification support a directly-encoded span probability, using this approach. It would be an unsigned integer field in the protocol containing the base-2 logarithm of the adjusted count (i.e., the negative base-2 logarithm of the inclusion probability):

0: The span's adjusted count is 1
1: The span counts for 2
2: The span counts for 4
3: The span counts for 8

So, perhaps the Span field should be named log2_adjusted_count? I will support that as an optimization, if there is a general agreement. Setting the log2_adjusted_count field to X is equivalent to setting the proposed span attribute sampling.adjusted_count to 2^X.

However, this field will not support tail sampling at arbitrary rates, which is an application with known potential and existing uses. In that sense, the proposed use of a span attributr sampling.adjusted_count seems less contentious and more flexible (e.g., supports integer and non-integer adjusted counts).

yurishkuro

@yurishkuro Do you feel that we can merge this OTEP?

So I have two concerns:

The propagation story is not completely clear, see comments inline.
There is only one approval so far. I think this topic needs a lot more attention, especially from different vendors who already had to deal with sampling.

text/0148-sampling-probability.md

jmacd · 2021-07-22T16:40:25Z

@yurishkuro Thank you. I will address your questions with one more round of work on this PR, however I want to highlight that the result of this OTEP is largely negative on propagating probability, and while I have given a proposed/draft specification for it I do not expect we will move forward with a specification that involves propagating anything. We are left with a desire to complete the TraceIDRatio sampler (which is to achieve consistent pseudo-randomness) and (independently) to know when traces are complete.

I believe this OTEP has met the standard for an OTEP, discussed here. This OTEP is now too long to review and another PR is needed. The next step will be, I think, another PR to revise this document, to focus specifically on the questions you've raised. Another piece of text is needed to fully document how to count spans based on the adjusted count that is recorded in the span.

Extrapolation from re-assembled trace is possible as long as the count is captured in the root.
should degrade nicely into this legacy mode

I don't see any part of this proposal that would prevent legacy counting strategies. If you agree that this proposal needs more editing, please approve so we can merge it and open a new PR. Thanks!

yurishkuro · 2021-07-22T18:03:33Z

I believe this OTEP has met the standard for an OTEP, discussed here.

Not sure I agree with that. The process we agreed for OTEPs is that we do not capture OTEP status in the document, but use GitHub status as a proxy. I.e. approved & merged PR means approved OTEP, which in turn means it is the official position for the project and only pending mechanical translation into Spec changes. The link you provided does not say that a PR can be merged with intention for revisions via another PR, it actually says the opposite that the old PR should be closed if revisions are needed. Otherwise we are left with officially looking document in OTEPs that does not reflect the agreements.

Concretely, if you do not believe that we should proceed with implementing probability propagation, then I would move that text into a Discussion area with pros/cons, and not include it in the normative portion that recommends the actual changes to the spec (which, incidentally, most of my comments are about since the proposed changes are inconsistent).

jmacd · 2021-07-22T18:23:18Z

If I remove the entire proposed specification section of this document, would you approve? The goal was to outline our options, and I included the specification text as an example. It states ("For example:") before that section of text. It's the least interesting part of this document to me, what it really did was show us how we do not want to propagate probability. I'm interested in updating this OTEP with a minimal summary of what we concluded and merging it, not continuing to address minor points in what is ultimately not a specification document.

yurishkuro · 2021-07-22T18:29:41Z

If you want to remove the proposed normative changes section, then it ceases to be an enhancement proposal (OTEP) imo. But I wouldn't mind merging it as it's a great read that can be referenced from other places.

jmacd · 2021-07-22T18:54:37Z

I think I will prefer to close and re-open a new PR. I will leave this open until then.

jmacd · 2021-07-23T00:29:21Z

@yurishkuro I wonder if we can salvage the bulk of this PR by removing the normative text that I had and specifying much less. My original goal with this OTEP was to specify a foundation, which is the basic idea that we can record an adjusted count when sampling to convey probability, and that adjusted count is a good way to convey that because users intuitively understand how to count but do not intuitively understand how to count inverses.

Thus, I've replaced all the text with a proposal for a new trace/semantic_conventions/sampling.md file with two attributes, sampler.adjusted_count and sampler.name. I think this is something we could all agree on, but it leaves a lot to be specified. It doesn't tell anyone how to modify the Samplers, but at least it would let users of OTLP who are not yet using OTel SDKs convey sampling information.

Originally I had proposed only an adjusted_count attribute, not the name attribute, but @oertl has shown why knowing which sampler computed the probability matters in some cases. Moreover, knowing the sampler name addresses an ambiguity that was discussed above: If the probability is not propagated then we cannot know the inclusion probability of a span written by the Parent sampler. We also should not presume the adjusted count is 1 in this case, so we have several possibilities:

Parent sampling, unknown probability:

sampler.name=Parent

Parent sampling, known probability (for 0.1 probability)

sampler.name=Parent
sampler.adjusted_count=10

TraceIDRatio sampling (for 0.1 probability)

sampler.name=TraceIDRatio
sampler.adjusted_count=10

And for no sampler, no attributes are needed. Spans carrying these attributes can be correctly counted, the only problem is for Parent sampling with unknown probability. At least now we have clearly identified when the adjusted count is missing. See the replacement text here: c4c06cd

I will take all the discussion about how to modify Samplers to produce these attributes as well as how to optionally propagate inclusion probability into another PR. Thanks.

jmacd · 2021-07-23T16:11:48Z

Why not always record sampler name?

The AlwaysOn sampler is (afaik) equivalent to no sampler, so I don't see any use for the name in that case. The adjusted count is 1 with or without the AlwaysOn sampler.

avoid inventing new naming scheme here

Yes, I see this point and feel ambivalent about it. I do not see any viable uses for the existing Description other than to configure a composite sampler in the SDK. It has the appearance of something that can be logged and parsed, but I wouldn't use it.

If the sampler description is "jaeger_remote", for example, it tells me nothing useful. The piece of information that is needed is not the composite policy that was configured as a Sampler, it is the effective "leaf" Sampler that was selected. If the Jaeger remote sampler selects a TraceIDRatio policy, that's what I want to know.

For the TraceIDRatio description, it encodes a floating point probability which is also (IMO) not a great representation for conveying adjusted count. The specification talks about how much precision should be logged for example, which only adds to the confusion. If I am logging 1 in 1024 spans, which @oertl would prefer we represent using the number 10 (i.e., 1 byte to say sampling probability is 1/2^10), the TraceIDRatio description will read "0.000977" (i.e., 8 bytes to encode a number with the addition of a 0.05% error, since 1/0.000977 = 1023.5).

As a vendor, we aren't actually concerned with knowing the sampler name. As this OTEP hopes to convince us, we can count spans without anything more than an adjusted count. I had two reasons to include it here:

@oertl gave a convincing reason why knowing that TraceIDRatio was used for example, as opposed to a tail-based sampling scheme
We had identified a gap for the Parent sampler, where we may or may not know its probability. Knowing the sampler avoids any ambiguity here and leaves the door open for specifying how to propagate the Parent probability.

jmacd · 2021-07-23T21:37:33Z

See the related OTEP #168

reyang

LGTM.

jmacd · 2021-07-27T19:32:18Z

I will re-open this OTEP with a fresh PR. This is too long to review.

jmacd · 2021-07-27T19:36:32Z

This is replaced by #170.

jmacd added 4 commits March 5, 2021 23:35

Sampling basics

79c9dfe

More prior art

22abb00

Applicability

02e97fc

Edits

7301d5f

jmacd requested a review from a team March 6, 2021 08:02

jmacd added 5 commits March 6, 2021 00:45

Typos

3600e77

Recommended reading

7c66df0

0

e97eba7

Edits

9592846

Zero

2691e72

yurishkuro reviewed Mar 6, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

yurishkuro reviewed Mar 6, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

akehlenbeck reviewed Mar 7, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

jmacd mentioned this pull request Mar 15, 2021

Review approach & specify algorithm for TraceIdRatioBasedSampler (ProbabilitySampler) open-telemetry/opentelemetry-specification#1413

Open

paulosman approved these changes Mar 17, 2021

View reviewed changes

seh reviewed Mar 18, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

seh reviewed Mar 18, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

seh reviewed Mar 18, 2021

View reviewed changes

Apply suggestions from code review

b542d89

Co-authored-by: Paul Osman <paul@eval.ca> Co-authored-by: Steven E. Harris <seh@panix.com>

seh mentioned this pull request Apr 3, 2021

REQUEST: New membership for seh open-telemetry/community#695

Closed

6 tasks

jmacd added 2 commits May 4, 2021 12:18

Revisions based on feedback

c1ce969

Merge branch 'jmacd/sample' of github.com:jmacd/oteps into jmacd/sample

7801908

jmacd added 2 commits May 4, 2021 12:31

Edits

b9f55df

Lint

14d8653

pyohannes reviewed May 5, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

oertl reviewed May 5, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

oertl reviewed May 5, 2021

View reviewed changes

text/0148-sampling-probability.md Outdated Show resolved Hide resolved

cyrille-leclerc mentioned this pull request Jul 19, 2021

[OpenTelemetry] Metrics derived from traces (Throughput, Latency and Errors) are not accurate when traces are sampled before being ingested by Elastic Observability elastic/apm#472

Closed

yurishkuro reviewed Jul 22, 2021

View reviewed changes

jmacd closed this Jul 22, 2021

jmacd reopened this Jul 22, 2021

propose less specification text: only two attributes

c4c06cd

Rename to ./trace

88d38b5

jmacd requested a review from a team July 23, 2021 19:36

jmacd mentioned this pull request Jul 23, 2021

Specify how to propagate consistent head sampling probability #168

Merged

remove <details>

3421058

reyang approved these changes Jul 27, 2021

View reviewed changes

jmacd added 4 commits July 27, 2021 12:00

typos

3587b11

refer to 168, shorten text on Parent sampler

a0012da

Remove final three sections, not necessary

0bfa686

more note

0795e72

jmacd closed this Jul 27, 2021

jmacd mentioned this pull request Jul 27, 2021

Probability sampling: Encode Span's head-adjusted count #170

Merged

jmacd deleted the jmacd/sample branch July 27, 2021 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probability sampling basics for telemetry events #148

Probability sampling basics for telemetry events #148

jmacd commented Mar 6, 2021 •

edited

Loading

jmacd commented Mar 11, 2021

paulosman left a comment

oertl commented Mar 20, 2021

jmacd commented May 4, 2021

oertl commented Jul 20, 2021

jmacd commented Jul 21, 2021

jmacd commented Jul 21, 2021 •

edited

Loading

yurishkuro left a comment

jmacd commented Jul 22, 2021

yurishkuro commented Jul 22, 2021

jmacd commented Jul 22, 2021 •

edited

Loading

yurishkuro commented Jul 22, 2021

jmacd commented Jul 22, 2021

jmacd commented Jul 23, 2021 •

edited

Loading

jmacd commented Jul 23, 2021

jmacd commented Jul 23, 2021

reyang left a comment

jmacd commented Jul 27, 2021

jmacd commented Jul 27, 2021

Probability sampling basics for telemetry events #148

Probability sampling basics for telemetry events #148

Conversation

jmacd commented Mar 6, 2021 • edited Loading

jmacd commented Mar 11, 2021

paulosman left a comment

Choose a reason for hiding this comment

oertl commented Mar 20, 2021

jmacd commented May 4, 2021

oertl commented Jul 20, 2021

jmacd commented Jul 21, 2021

jmacd commented Jul 21, 2021 • edited Loading

yurishkuro left a comment

Choose a reason for hiding this comment

jmacd commented Jul 22, 2021

yurishkuro commented Jul 22, 2021

jmacd commented Jul 22, 2021 • edited Loading

yurishkuro commented Jul 22, 2021

jmacd commented Jul 22, 2021

jmacd commented Jul 23, 2021 • edited Loading

jmacd commented Jul 23, 2021

jmacd commented Jul 23, 2021

reyang left a comment

Choose a reason for hiding this comment

jmacd commented Jul 27, 2021

jmacd commented Jul 27, 2021

jmacd commented Mar 6, 2021 •

edited

Loading

jmacd commented Jul 21, 2021 •

edited

Loading

jmacd commented Jul 22, 2021 •

edited

Loading

jmacd commented Jul 23, 2021 •

edited

Loading