Give samplers the span ID #3991

adriangb · 2024-04-05T21:01:47Z

There are various use cases, in particular span level sampling, that would benefit from the span ID being generated before calling the sampler.

Given that it's very cheap to generate the span ID, that it is often generated just a couple of lines of code after calling the sampler and that it was implemented that way for a while in Python at least I don't think this is a very big change. The current spec lists required arguments but does not prohibit extra arguments. Thus I believe that listing the span ID as a "recommended" or "suggested" (or however it's best worded) argument that will be required next major version in practice frees up SDKs to include it and only adds minor annoyance to ShouldSample implementations (if they need the argument they'll have to check if it was included, e.g. if span_id is not None in Python).

And it unblocks multiple use cases:

Sampling at the span level. Imagine you're doing an operation on an array of 1,000,000 items. It'd be nice to be able to sample some of the operations but not all. Currently that's not possible (unless the operation is the root of a trace, since we only sample at the trace level).
Something akin to log-levels for spans by having a sampler sample at 100% for anything above it's predetermined level and 0% for anything below.

This would not solve some other more complex requests, in particular ones such as #3867 where the idea is to not-emit a span but have it's children be emitted.

I've done a sample implementation in Python that is quite simple: https://github.com/open-telemetry/opentelemetry-python/pull/3574/files#diff-b51d60b85f22187922aa5324512b7ef69fc259a932ffb33ac9bc011cb50debcd

The text was updated successfully, but these errors were encountered:

dmathieu · 2024-04-08T07:21:36Z

Other SDKs, for example, Go already do it.
So while it's not specified, doing so is also entirely possible in Python (meaning you should be able to open a PR in the python repo with that change even before any specification change happens).

pellared · 2024-04-08T08:22:57Z

Other SDKs, for example, Go already do it.

Permalink just in case: https://github.com/open-telemetry/opentelemetry-go/blob/014c6fc3323554065532b2b260945a9d0cd62b74/sdk/trace/tracer.go#L78-L87

MrAlias · 2024-04-08T20:12:13Z

Go passes the trace ID. I do not think the span ID is provided.

pichlermarc · 2024-04-09T11:56:13Z

Without speaking for or against it:

I think this would also be a fairly simple change in JS. We generate the spanId before we get a sampling decision, we just don't pass it to the sampler.

dyladan · 2024-04-16T02:23:59Z

The issue asks for span ID not trace ID. I'm not sure I really understand the use case. Span level sampling can be done just as easily by generating a random number as it can by reading the span id (which is itself a random number), but the problem with span level sampling is that you end up with a broken trace if you cut out random spans. The log levels for spans as you describe it sounds more like an adaptive rate sampler which has been proposed before and again would require you to sample at the local root level, not the span level.

Other SDKs, for example, Go already do it. So while it's not specified, doing so is also entirely possible in Python (meaning you should be able to open a PR in the python repo with that change even before any specification change happens).

Go passes the trace ID. I do not think the span ID is provided.

Passing trace id to sampler is specified in that context is passed to sampler. Trace ID is in context anyway so it's just a convenience

adriangb · 2024-04-16T02:32:23Z

Why would you end up with broken traces? You’d just emit a non-recording span and all children spans would thus be non-recording.

jmacd · 2024-04-16T16:19:05Z

The Sampler API is missing:

Resource information
Instrumentation scope information
SpanID

I mostly agree with the claim by @dyladan ("can be done just as easily by generating a random number"). I'm sure someone can formulate a reason why consistent sampling would be useful here, in which case use of the SpanID would be nice to have. Do you need consistent sampling of Spans, @adriangb? (E.g., if you didn't care about trace completeness and wanted to sample logs and spans consistently, this is what you would want.)

adriangb · 2024-04-16T18:31:13Z

The random number question is a good one. I guess I was just thinking along the lines of doing the same thing as trace id / trace level sampling but there's a whole reasoning there related to consistent sampling across services that I have never implemented nor do I pretend to really understand. I will say that for tests I wire things up with sequentially incrementing trace / span ids so consistent sampling would be nice just for determinism. That's not a strong argument and even then you could pass a seed into the sampler or something, even if it is more boilerplate.

A random number implementation would look something like:

Example

from __future__ import annotations

import random
from typing import Sequence, cast

from opentelemetry import context as context_api, trace
from opentelemetry.sdk.trace import sampling
from opentelemetry.util.types import Attributes


class AttributeBasedSampler(sampling.Sampler):
    def __init__(self, seed: int | None = None) -> None:
        super().__init__()
        self.random = random.Random(seed)

    def set_seed(self, seed: int) -> None:
        self.random.seed(seed)

    def should_sample(
        self,
        parent_context: context_api.Context | None,
        trace_id: int,
        name: str,
        kind: trace.SpanKind | None = None,
        attributes: Attributes | None = None,
        links: Sequence[trace.Link] | None = None,
        trace_state: trace.TraceState | None = None,
    ) -> sampling.SamplingResult:
        sample_rate = cast(float | None, (attributes or {}).get('sample_rate'))
        if sample_rate is not None:
            if self.random.uniform(0, 1) < sample_rate:
                decision = sampling.Decision.RECORD_AND_SAMPLE
            else:
                decision = sampling.Decision.DROP
                attributes = None
            return sampling.SamplingResult(
                decision,
                attributes,
            )
        return sampling.SamplingResult(
            sampling.Decision.RECORD_AND_SAMPLE,
            attributes,
        )

    def get_description(self) -> str:
        return 'AttributeBasedSampler'


# usage

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    ConsoleSpanExporter,
    SimpleSpanProcessor,
)

span_sampler = AttributeBasedSampler(seed=42)
provider = TracerProvider(sampler=sampling.ParentBased(
    root=sampling.TraceIdRatioBased(1),
    remote_parent_sampled=span_sampler,
    local_parent_sampled=span_sampler,
))
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span('foo'):
    for _ in range(1000):
        with tracer.start_as_current_span('bar', attributes={'sample_rate': 0.001}):
            with tracer.start_as_current_span('baz'):
                pass

That said, I still feel that there's no reason to not give the sampler all of the information available. I'm not sure about the others that @jmacd mentioned but the span ID should be very cheap to provide (again it's generated just a couple of lines of code down from where the sampler is called).

dyladan · 2024-04-17T16:39:36Z

That said, I still feel that there's no reason to not give the sampler all of the information available. I'm not sure about the others that @jmacd mentioned but the span ID should be very cheap to provide (again it's generated just a couple of lines of code down from where the sampler is called).

I don't buy this argument. For one thing I don't accept the idea that generating the ID is cheap as that depends on a lot of implementation details that can't be guaranteed (some depend on things like clock, some random number generators are more expensive than others, allocations aren't free). Span creation is something that is done a lot often in tight loops and even "cheap" things can become expensive. "No reason not to" is not itself a reason to do something.

dyladan · 2024-04-17T16:43:05Z

E.g., if you didn't care about trace completeness and wanted to sample logs and spans consistently, this is what you would want

This is an interesting idea that might be more compelling IMO

adriangb · 2024-04-17T16:57:59Z

I may be missing something but where does the "not caring about trace completeness" (which I assume refers to the scenario where you end up with a trace that is missing spans in the middle) come into play? If you run the example I gave above (using random the random number approach) there are no "broken" traces. Or are we using "incomplete trace" to refer to a trace that is not "broken" but has had a tree of spans pruned from it?

I think the really important use case is something like this:

with span('processing_items'):
    for item_to_process in range(1000000):
        with span('processing_item'):
            # do work

Currently, as far as I know, you have to either use trace sampling and get few traces with all 1000000+ spans in them or just get all of the data and deal with it. The idea is that you can sample at the loop level so that you get a representative sample of the work being done in the loop.

dyladan · 2024-04-18T17:49:37Z

I still don't see how this requires span id. Seems like you could just do rng() % n to accomplish exactly the same goal.

adriangb · 2024-04-18T17:53:02Z

Yeah that's a fair point. Using a random number won't easily give you consistent, deterministic sampling, which I think is valuable but not a hill I'm willing to die on.

I still think it's worth clarifying what @dyladan and @jmacd are referring to by "trace completeness"

dyladan · 2024-04-18T18:00:26Z

"trace completeness" means exactly what you would expect. Pruning branches off the tree changes its meaning and breaks a lot of assumptions for analysis. In some cases it is what you want, but it complicates things quite a bit.

If you want to sample every nth span the span ID still doesn't help you as spans are generally randomly generated. Using an incremental counter for span ids specifically to affect your sampling is an anti-pattern in my opinion. Would be much simpler to have the sampler just maintain a counter that resets to 0 when it gets to n and wouldn't create dependencies between your span id generation algorithm and your sampling strategy.

i = 0 // counter
n = 4 // sample every 4th span
func shouldSample() {
  i = i % n; // counter is bounded and resets
  return (n == 0) // true every nth span
}

adriangb · 2024-04-18T18:10:07Z

If you want to sample every nth span the span ID still doesn't help you as spans are generally randomly generated

That's not what I want. I want to be able to get some information from the execution without being flooded with data. Random sampling achieves that, just as it does for traces. That would be one way of achieving it but isn't necessarily the only way. I think this is no different than the goals of sampling traces fundamentally and so it makes sense to me to use the same approach (with the caveat that if you don't care about determism you can use a random number).

Pruning branches off the tree changes its meaning and breaks a lot of assumptions for analysis. In some cases it is what you want, but it complicates things quite a bit.

Unless I'm missing something, I don't think this is true. Consider the valid program:

with span('parent'):
    if some_non_deterministic_thing_happening():
        with span('child'):
            ...

That's a super common pattern. The non-determinism could come from a network error, etc. Not every parent is going to have a child, sampling or not. You already have to have awareness of how the code is executing when you're doing analysis, this is just another such case. So yes you can't assume every parent will have a child but you can e.g. measure the average duration of a child or check if there are outliers from the sampled population.

jmacd · 2024-04-19T17:42:43Z

I've begun drafting a V2 sampler API, in order to develop support for OTEP 250. I think it is worth including the SpanID in the sampling parameters, for example because of #2918 -- if the number of Span Links becomes too great, they will need to be sampled using the SpanID.

Here is a definition for completeness, by the way.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md#trace-completeness

jack-berg · 2024-04-19T17:53:46Z

@jmacd I think the current interpretation of the triage:accepted:ready is that the issue is has a small / simple scope, and is ready to be worked on by anyone who volunteers. Is that the case with this issue?

cijothomas · 2024-04-26T01:52:47Z

Sharing the original PR for reference, where Span ID was explicitly removed as a sampler input : #621

adriangb added the spec:trace Related to the specification/trace directory label Apr 5, 2024

danielgblanco added the area:sampling Related to trace sampling label Apr 8, 2024

danielgblanco mentioned this issue Apr 8, 2024

Document issue triage labels #3990

Merged

austinlparker added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label Apr 9, 2024

jmacd added triage:accepted:ready Ready to be implemented. Small enough or uncontroversial enough to be implemented without sponsor and removed triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted labels Apr 19, 2024

jmacd self-assigned this Apr 24, 2024

jmacd mentioned this issue Apr 24, 2024

Project Tracking: Sampling #4012

Open

kalyanaj mentioned this issue May 10, 2024

Sampler API V2: What, Why, and How? #4044

Open

austinlparker added this to 🔭 Main Backlog Jul 16, 2024

austinlparker moved this to Spec - Accepted in 🔭 Main Backlog Jul 16, 2024

austinlparker removed the triage:accepted:ready Ready to be implemented. Small enough or uncontroversial enough to be implemented without sponsor label Jul 16, 2024

austinlparker added the triage:accepted:ready-with-sponsor Ready to be implemented and has a specification sponsor assigned label Jul 16, 2024

austinlparker moved this from Spec - Accepted to Spec - In Progress in 🔭 Main Backlog Jul 16, 2024

jmacd mentioned this issue Oct 3, 2024

Randomness requirements following W3C Trace Context level 2 #4162

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give samplers the span ID #3991

Give samplers the span ID #3991

adriangb commented Apr 5, 2024

dmathieu commented Apr 8, 2024 •

edited

Loading

pellared commented Apr 8, 2024

MrAlias commented Apr 8, 2024

pichlermarc commented Apr 9, 2024

dyladan commented Apr 16, 2024

adriangb commented Apr 16, 2024

jmacd commented Apr 16, 2024

adriangb commented Apr 16, 2024

dyladan commented Apr 17, 2024

dyladan commented Apr 17, 2024

adriangb commented Apr 17, 2024

dyladan commented Apr 18, 2024

adriangb commented Apr 18, 2024

dyladan commented Apr 18, 2024 •

edited

Loading

adriangb commented Apr 18, 2024 •

edited

Loading

jmacd commented Apr 19, 2024

jack-berg commented Apr 19, 2024

cijothomas commented Apr 26, 2024

Give samplers the span ID #3991

Give samplers the span ID #3991

Comments

adriangb commented Apr 5, 2024

dmathieu commented Apr 8, 2024 • edited Loading

pellared commented Apr 8, 2024

MrAlias commented Apr 8, 2024

pichlermarc commented Apr 9, 2024

dyladan commented Apr 16, 2024

adriangb commented Apr 16, 2024

jmacd commented Apr 16, 2024

adriangb commented Apr 16, 2024

dyladan commented Apr 17, 2024

dyladan commented Apr 17, 2024

adriangb commented Apr 17, 2024

dyladan commented Apr 18, 2024

adriangb commented Apr 18, 2024

dyladan commented Apr 18, 2024 • edited Loading

adriangb commented Apr 18, 2024 • edited Loading

jmacd commented Apr 19, 2024

jack-berg commented Apr 19, 2024

cijothomas commented Apr 26, 2024

dmathieu commented Apr 8, 2024 •

edited

Loading

dyladan commented Apr 18, 2024 •

edited

Loading

adriangb commented Apr 18, 2024 •

edited

Loading