[processor/tailsampling] invert_match precedence #34085

hyang023 · 2024-07-16T00:27:13Z

Component(s)

processor/tailsampling

What happened?

Description

a recent fix went in to address #33656 and align with expected behavior documented at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
When there's an "inverted not sample" decision, the trace is not sampled

however, in the same README, the tail_sampling.policies[] example policy list includes this backwards compatibility policy

      {
        # Rule 1: use always_sample policy for services that don't belong to team_a and are not ready to use tail sampling
        name: backwards-compatibility-policy,
        type: and,
        and:
          {
            and_sub_policy:
              [
                {
                  name: services-using-tail_sampling-policy,
                  type: string_attribute,
                  string_attribute:
                    {
                      key: service.name,
                      values:
                        [
                          list,
                          of,
                          services,
                          using,
                          tail_sampling,
                        ],
                      invert_match: true,
                    },
                },
                { name: sample-all-policy, type: always_sample },
              ],
          },
      },

and using a collector with this fix means that service.name values [list, of, services, using, tail_sampling] no longer can have other sampling policies applied to them since their InvertNotSampled condition will override any applicable Sampled or InvertSampled

Is this change in behavior intended? And if so, what is the new recommended way to define a backwards compatibility policy?

Steps to Reproduce

use the tail sampling processor
define a set of tail sampling policies including a backwards compatibility policy

Expected Result

Expect to be able to define sampling policies for service.name values [list, of, services, using, tail_sampling]
and always_sample for any other service.name not in this list (or as specified in other policies)

Actual Result

no traces for service.name values [list, of, services, using, tail_sampling] can be exported since all are excluded as a result of InvertNotSampled

Collector version

0.104.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-16T00:27:28Z

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

savar · 2024-07-30T11:09:52Z

Another (possible) issue with invert_match precedence is (or I am just not able to do that), to do something like:

probabilistic-ally catch http.route == /health for 0.1%
probabilistic-ally catch http.route != /health for 20%

The 1st one works easily, but the second one would need to not apply to /health and therefore an invert_match seems to be required, which, even in an and type will create a do not sample decision which will override the 1st rule.

I think the issue is the same as the backwards compatibility example.

github-actions · 2024-09-30T03:36:19Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/tailsampling: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

avrong · 2024-10-01T13:26:58Z

The issue @savar mentioned is present for my use case.

What I want to do is a sampling of grpc requests traces that would depend on condition. It is useful when an application generates lots of traces with one of requests. In this one we're not really interested in, but it is still better to be shown in smaller scale. The problem is that large amount of traces takes more disk space and has impact on Grafana dashboard requests and web UI.

It worked like this: if rpc.method == methodA, then do probabilistic sampling. Otherwise, on all other requests, don't do anything and sample them all. It involved a hacky way of doing and policy with two branches -- one with string_attribute match in one case, and other one with inverted string_match on the same list (with anchor on the match list). With previous behavior it was possible, however after v0.103.0 I couldn't find a way to do that and did not manage to implement it with regular expressions or ottl conditions.

  tail_sampling:
    policies:
      # Applies probabilistic sampling to set of methods
      - name: probabilistic-to-listed-methods
        type: and
        and:
          and_sub_policy:
            - name: spans-with-listed-methods
              type: string_attribute
              string_attribute:
                key: rpc.method
                # List of methods to which probabilistic sampling applies
                values: &probabilistic-methods
                  - methodA
            - name: probabilistic-sampling
              type: probabilistic
              probabilistic:
                sampling_percentage: 5
      # Bypasses other traces
      - name: bypass-others
        type: and
        and:
          and_sub_policy:
            - name: other-spans
              type: string_attribute
              string_attribute:
                key: rpc.method
                values: *probabilistic-methods
                # Accept all except ones in list
                invert_match: true
            - name: catch-all
              type: always_sample

It would be great to have ability to do similar conditions again, probably in some other form.

karinhawk · 2024-11-06T10:00:21Z

Our team ran into this issue during an upgrade of the collector where our sampling stopped working because of this. The only way so far we've managed to find a workaround for the backwards compatibility policy is a bit horrible but if anyone is in a pinch it works!

Because "inverted not sample" policies take precedence above everything else, don't use it for the backwards compatible policy, instead use a "not sample" decision where you enable regex on an attribute to enable sampling e.g. sampling:true with true being a negative match.

The processor uses grep/re2 syntax, meaning it doesn't support the usual way of negative matching - SO post
However, the correct regex can be generated using this website made by the commenter in that post.

For negative matching on the attribute so services with that attribute in their resource attributes can then be sampled later while services with that value as false are sampled 100%, it looks a little like this:

     [ 
      {
        name: 'backwards-compatibility-policy',
        type: 'and',
        and: {
          and_sub_policy: [
            {
              name: 'services-using-tail_sampling-policy',
              type: 'string_attribute',
              string_attribute: {
                key: 'sampling',
                values: [
                 '^([^t]|t(t|r(t|ut))*([^rt]|r([^tu]|u[^et])))*(t(t|r(t|ut))*(ru?)?)?$',
                ],
                enabled_regex_matching: true
              },
            },
          ],
        },
      },
      ...sampling policies
    ]

It allows you to replicate the behaviour from the backwards compatibility policy in the documentation examples, which is great, even though it's a little ugly.

EDIT: Another way to tackle this problem without being so hacky, is to include the above sampling attribute in instrumentation in services you wish to sample. Then (if you are using the loadbalancing exporter collector -> sampler collector -> exporter collector/endpoints pattern) filter on this attribute in the sampler collector, and have only those traces enter the processor, and send the others straight to a backend of choice in a separate pipeline. This means you only need to define policies in the sampler which keep specified traces, making the need for a backwards compatibility policy redundant.

hyang023 · 2024-11-06T14:17:08Z

it could be helpful to hear from a code owner (@jpkrohling) what the intended/expected behavior is for invert sampling inside of a composite policy (e.g. should the invert-not-sample from a subpolicy apply to the scope of only that composite policy as a whole, as it did before, or should its scope extend beyond that, as it does now?)
and if so, what would be the new recommendation for implementing "backwards compatibility" policies given this breaking change?

jpkrohling · 2024-11-18T15:00:15Z

Our team ran into this issue during an upgrade of the collector where our sampling stopped working because of this

I'm sorry about that, this shouldn't have happened. In fact, I'm open to reverting that PR and go back to the drawing board for the original issue it intended to solve.

what the intended/expected behavior is for invert sampling inside of a composite policy

I believe the result of the combination of AND policies should be either SAMPLE or NOT SAMPLE. If the AND policy had an invert attribute, then it would be OK to return an invert sample/not sample result.

hyang023 · 2024-11-18T21:51:25Z

I'm open to reverting that PR and go back to the drawing board for the original issue it intended to solve.

I think reverting the PR first, and then having any additional discussions after, sounds like a good idea

the combination of AND policies should be either SAMPLE or NOT SAMPLE. If the AND policy had an invert attribute, then it would be OK to return an invert sample/not sample result.

This makes sense! Thanks for clarifying

mrsimo · 2024-11-20T11:55:49Z

Not to pile in, but we discovered we've been hitting this problem as well. We had one policy for known high volume endpoints that we want to sample less often (0.1%), and one for all the others (5%). We had implemented this with AND policies and string_attribute against the list of endpoints, one of them with invert_match.

Since we have a few endpoints I don't think @karinhawk's solution works for us.

I've been racking my brains to find an alternative but I can't come up with one. I've tried using an ottl_condition, but also not possible. It would be possible to increase the amount of sampling under a condition, but not the opposite.

A revert would be welcome. Having an invert_match on the and policy itself would make a lot of sense to me!

hyang023 added bug Something isn't working needs triage New item requiring triage labels Jul 16, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Jul 16, 2024

github-actions bot mentioned this issue Jul 16, 2024

Weekly Report: 2024-07-09 - 2024-07-16 #34087

Closed

jpkrohling self-assigned this Jul 16, 2024

github-actions bot mentioned this issue Jul 23, 2024

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

github-actions bot mentioned this issue Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Closed

github-actions bot mentioned this issue Aug 6, 2024

Weekly Report: 2024-07-30 - 2024-08-06 #34410

Closed

This was referenced Aug 13, 2024

Weekly Report: 2024-08-06 - 2024-08-13 #34626

Closed

Weekly Report: 2024-08-13 - 2024-08-20 #34743

Closed

This was referenced Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Closed

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Closed

github-actions bot mentioned this issue Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

github-actions bot added the Stale label Sep 30, 2024

github-actions bot mentioned this issue Oct 1, 2024

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

github-actions bot removed the Stale label Oct 2, 2024

atoulme removed the needs triage New item requiring triage label Oct 2, 2024

karinhawk mentioned this issue Nov 6, 2024

[processor/tailsampling] fix InvertNotSampled decision precedence when inside and sub policy #33671

Merged

jpkrohling mentioned this issue Dec 4, 2024

[processor/tailsampling] Allow invert matches in composite policy to continue processing #36673

Merged

jpkrohling mentioned this issue Dec 4, 2024

[processor/tailsampling] invert_match not given precedence when inside and policy #33656

Open

evan-bradley closed this as completed in #36673 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsampling] invert_match precedence #34085

[processor/tailsampling] invert_match precedence #34085

hyang023 commented Jul 16, 2024

github-actions bot commented Jul 16, 2024

savar commented Jul 30, 2024

github-actions bot commented Sep 30, 2024

avrong commented Oct 1, 2024 •

edited

Loading

karinhawk commented Nov 6, 2024 •

edited

Loading

hyang023 commented Nov 6, 2024

jpkrohling commented Nov 18, 2024

hyang023 commented Nov 18, 2024

mrsimo commented Nov 20, 2024

[processor/tailsampling] invert_match precedence #34085

[processor/tailsampling] invert_match precedence #34085

Comments

hyang023 commented Jul 16, 2024

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jul 16, 2024

savar commented Jul 30, 2024

github-actions bot commented Sep 30, 2024

avrong commented Oct 1, 2024 • edited Loading

karinhawk commented Nov 6, 2024 • edited Loading

hyang023 commented Nov 6, 2024

jpkrohling commented Nov 18, 2024

hyang023 commented Nov 18, 2024

mrsimo commented Nov 20, 2024

avrong commented Oct 1, 2024 •

edited

Loading

karinhawk commented Nov 6, 2024 •

edited

Loading