Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Sampling #365

Closed
3 tasks done
yurishkuro opened this issue Sep 1, 2017 · 38 comments · Fixed by #2966
Closed
3 tasks done

Adaptive Sampling #365

yurishkuro opened this issue Sep 1, 2017 · 38 comments · Fixed by #2966
Assignees

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Sep 1, 2017

Problem

The most common way of using Jaeger client libraries is with probabilistic sampling which makes a determination if a new trace should be sampled or not. Sampling is necessary to control the amount of tracing data reaching the storage backend. There are two issues with the current approach:

  1. The individual microservices have little insight into what the appropriate sampling rate should be. For example, 0.001 probability (one trace per second per service instance) might seem reasonable, but if the fanout in some downstream services is very high it might flood the tracing backend.
  2. Sampling rates are defined on a per-service basis. If a service has two endpoints with vastly different throughputs, then its sampling rate will be driven on the high QPS endpoint, which may leave the low QPS endpoint never sampled. For example, if the QPS of the endpoints is different by a factor of 100, and the probability is set to 0.001, then the low QPS traffic will have only 1 in 100,000 chance to be sampled.

Proposed Solution

The adaptive sampling is a solution that addresses these issues by:

  1. Assigning sampling probabilities on a service + endpoint basis rather than just the service
  2. Using a lower bound rate limiter to ensure that all endpoints are sampled with a certain minimal rate
  3. Observing the impact of sampling rates on the overall number of traces sampled from a service and dynamically adjusting the per-endpoint sampling rates to meet certain target rates.

Status

Pending open-source of the backend functionality. Client work is done.

@robdefeo
Copy link

Any idea when the backend functionality will be opensourced?

@billowqiu
Copy link

"Adaptive Sampling" is now ok in backend?

@yurishkuro
Copy link
Member Author

it's coming soon, @black-adder just finished rolling it out internally to all services, so it just needs a bit clean-up (from any internal deps) to move to open source.

@billowqiu
Copy link

thks @yurishkuro , i am investigating the jaeger and zipkin。

@trtg
Copy link

trtg commented Jun 19, 2018

@yurishkuro any progress on this being released?

@yurishkuro
Copy link
Member Author

question to @black-adder , "he is a-cooking something up"

@sergeyklay
Copy link

@black-adder any news?

@black-adder
Copy link
Contributor

sorry all, I just started to move the pieces over, hopefully we'll have the whole thing in OSS this week.

@agxp
Copy link

agxp commented Oct 11, 2018

What's the status on this? I would like to configure Jaeger to sample all traces on low load, and on high load sample at a certain probability. It doesn't seem possible currently. Thanks.
@black-adder

@pravarag
Copy link

Hi, is adaptive sampling still under way? I'm really eager to try it out for my sample app.

@yurishkuro
Copy link
Member Author

The PRs are in progress/review. Unfortunately, a higher priority project has delayed this.

@Glowdable
Copy link

It's very important

@capescuba
Copy link

Another check in on progress for this. We are considering an implementation of this and this would be a great feature add.

@csurfleet
Copy link

Any further news on this?

@m1o1
Copy link

m1o1 commented Apr 1, 2019

This feature would help us as well - it seems that under high load, some spans are being dropped / never received by ES (since we are trying to sample all traces currently). We are hoping to sample 100% of "unique" traces (similar to what differentiates traces in "Compare" in the UI) in the last X amount of time. Heard about this idea from an OpenCensus talk, which sounded like they're working on a similar feature in their agent service.

@wuyupengwoaini
Copy link

Any further news on this? I am looking forward to this feature

@yurishkuro
Copy link
Member Author

the main code has been merged, pending wiring into collector's main

@trtg
Copy link

trtg commented Apr 22, 2019

@capescuba @wuyupengwoaini @adinunzio84 @csurfleet not sure if this helps in your current context, but the way we've been implementing an approximation of this feature for our usecase (moderately high throughput- 10s of thousands of requests per second) is we set keys in redis that control sampling through the use of the sampling priority debug header. In other words we set the default probabilistic sampling rate to zero and the code checks a set of redis keys to know whether it should sample. So for example in the context of my particular scenario we get requests from many different applications and devices, and in redis we set keys denoting which apps or devices we want to trace and what percentage of requests for those apps we want to trace. So if we have a specific issue to debug we set redis keys to trace 100% of requests from the problematic app and some low percentage of requests from apps that we are passively monitoring
. For example in Python we dynamically force tracing based on redis keys by setting this tag:

span.set_tag(ext_tags.SAMPLING_PRIORITY, 1)

@wuyupengwoaini
Copy link

@trtg Thanks for your advice. In fact, I think the sampling rate strategy can be divided into three steps: 1. The sampling rate is configurable. That is to say there is a system to configure the sampling rate of each interface in each service. 2. The configuration can be dynamically activated (can be implemented by means of a configuration center or the like) 3. The system automatically configures the sampling rate dynamically according to the pressure of the jaeger backend.

Dynamically configuring the sample rate for your use of redis is actually the second step I mentioned above. This can modify the sampling rate in real time, but it is enough.

@csurfleet
Copy link

Hey all, I've created a nuget package that allows per-request sampling on anything in the incoming HttpRequest, feel free to have a play. Any suggestions etc either let me know or send me a PR ;) I'm using this stuff in production for one app so far, and I'll try to add further stuff later on:
https://github.com/csurfleet/JaegerSamplingFilters
https://www.nuget.org/packages/JaegerSamplingFilters/

@DannyNoam
Copy link

Hi @yurishkuro! Just reading about the different sampling strategies in Jaeger, and it's slightly unclear to me whether Adaptive Sampling would reference a central configuration (which would be somewhat less verbose than in the 'sampling strategies' file), or whether we'd specify the sampling rate in the service (and that the adaptive sampler would ensure lower QPS endpoints have their fair share of traces). Cheers!

@yurishkuro
Copy link
Member Author

@DannyNoam This ticket refers to dynamic adaptive sampling where strategies are automatically calculated based on observed traffic volumes. In jaeger-collector the only configuration would be "target rate of sampled traces / second" (in the current code it's a single global setting, but we're looking to extend it to be configurable per service/endpoint).

@rscott231
Copy link

the main code has been merged, pending wiring into collector's main

@yurishkuro any timeline on when this will be wired into collector's main?

@lookfwd
Copy link

lookfwd commented Jun 26, 2019

Sampling rates are defined on a per-service basis. If a service has two endpoints with vastly different throughputs

I was thinking that with a small generalization, one could extend the adapting sampling idea to something between head-based and tail-based sampling that also detects outliers and has other benefits of tail-based sampling while requiring fewer computational resources.

As part of the adapting sampling, sampling information is communicated from the collector to the agent.

The format of the communicated data has, I guess, format similar to the Collector Sampling Configuration. In the documentation there's this example:

{
  "service_strategies": [
    {
      "service": "foo",
      "type": "probabilistic",
      "param": 0.8,
      "operation_strategies": [
        {
          "operation": "op1",
          "type": "probabilistic",
          "param": 0.2
        },
        {
          "operation": "op2",
          "type": "probabilistic",
          "param": 0.4
        }
      ]
    },
    {
      "service": "bar",
      "type": "ratelimiting",
      "param": 5
    }
  ],
  "default_strategy": {
    "type": "probabilistic",
    "param": 0.5
  }
}

We can change it slightly to this:

{
  "sampling-rules": [
    {"name": "foo_op_1",
     "condition": {"and" : [
       {"==": [{"var" : "span.process.serviceName" }, "foo"]},
       {"==": [{"var" : "span.operationName"}, "op1"]}]
     },
     "strategy": {
       "type": "probabilistic",
       "param": 0.2
     }
    },
    {"name": "foo_op_2",
     "condition": {"and" : [
       {"==": [{"var" : "span.process.serviceName" }, "foo"]},
       {"==": [{"var" : "span.operationName"}, "op2"]}]
     },
     "strategy": {
       "type": "probabilistic",
       "param": 0.4
     }
    },
    {"name": "bar",
     "condition": {"==": [{"var" : "span.process.serviceName" }, "bar"]},
     "strategy": {
       "type": "ratelimiting",
       "param": 5
     }
    },
    {"name": "default",
     "strategy": {
       "type": "probabilistic",
       "param": 0.5
     }
    } 
  ],
  "tables": ...
}

What do we see here?

  1. There's a name to every rule that helps with debugging.
  2. There's a condition to each rule. I use JsonLogic.
  3. If multiple rules match, the first one wins i.e. order within sampling-rules matters.
  4. There's a "tables" field that we will get to in a bit

This small extension enables the client and the agent to do powerful cherry-picking of what to sample.

The rules defined above are evaluated at any point in time a Span changes. More specifically, Spans in Jaeger, can be represented as JSONs that adhere to this model. Examples can be found in example_trace.json:

    {
      "traceId": "AAAAAAAAAAAAAAAAAAAAEQ==",
      "spanId": "AAAAAAAAAAM=",
      "operationName": "example-operation-1",
      "references": [],
      "startTime": "2017-01-26T16:46:31.639875Z",
      "duration": "100000ns",
      "tags": [],
      "process": {
        "serviceName": "example-service-1",
        "tags": []
      },
      "logs": [
        {
          "timestamp": "2017-01-26T16:46:31.639875Z",
          "fields": []
        },
        {
          "timestamp": "2017-01-26T16:46:31.639875Z",
          "fields": []
        }
      ]
    }

Pieces of information are collected at different points in time. e.g. the duration isn't known till the end of the Span, while operationName is known at the beginning. At any point there's a change the rules are re-evaluated and they can yield a "sample" decision. Examples of when a Span model might be updated are:

  • start
  • finish
  • When new tags and logs are added, including when error conditions are detected
  • carrier injection (spawning children, at which point, depending on the RPC mechanism, a Span might be able to get additional tags like target/child serviceName, operationName, host etc. - since they're likely known e.g. when I do an http request, I know the target endpoint).
  • RPC timeout or
  • RPC call return , at which point a Span might get to know if the child decided to sample itself and/or if there were any errors downstream.

The above might seem as high CPU overhead, but I can imagine many optimizations that return quickly if there are no rules that match a Span change. JIT compilation of the rules might also be a way to accelerate further. The high performance of Chrome, DOM and V8 indicate that we might be able to have fast implementations.

Here are some basic examples. To evaluate them, I wrap the example Span presented above to a {"span": ... } and use the JsonLogic prompt one can find here.

The following condition evaluates to true for the example Span:

{"and" : [
  {"==": [{"var" : "span.process.serviceName" }, "example-service-1"]},
  {"==": [{"var" : "span.operationName"}, "example-operation-1"]}]
}

There are two default "force sample" rules in Jaeger. Can we implement them using this framework? Yes:

{"some" : [ {"var" : "span.tags" },
            {"==": [{"var":"key"}, "jaeger-debug-id"]} ]}

and

{"some" : [ {"var" : "span.tags" },
            {"and": [{"==": [{"var":"key"}, "SAMPLING_PRIORITY"]},
                     {">": [{"var":"value"}, 0]}]} ]} 

In those two cases the strategy of the rule could return const: 1 which will force sampling.

There are two safety conditions we would like to guarantee:

  1. No matter how bad the configuration, the sampler shouldn’t slow down the collection process
  2. No matter how bad the configuration, the sampler shouldn’t overwhelm the infrastructure

As a result, proper engineering should be put in place. The language of JsonLogic isn't turning complete but still, there could be problems if the Spans have tags with long string values, long arrays or the configuration has rules that are overly long. Rules can be auto-generated, e.g. by some real-time trend analysis system, so it's wise to put some controls in the agent/client that reject configurations that might be slow. On the second point, above, if the rules end-up sampling everything while our infrastructure can handle just 1:100 sampling, we wouldn't like to overwhelm infra. As a result, we might need some form of cascading rules or global mechanism that limits total throughput.

Can we use this form of adaptive sampling to implement rules that sample outliers in terms of duration? Yes. Here's such a rule:

{">": [
  {"var": "span.duration"},
  {"var": [{"cat": ["tables.d_percentiles",
                 ".", {"var" : "span.process.serviceName" },
                 ".", {"var" : "span.operationName" }
                 ]}, 10000000]}]}

We can test it with a (simplified) example span:

{
  "span":     {
      "operationName": "example-operation-1",
      "duration": 100000,
      "process": {
        "serviceName": "example-service-1"
      }
    },
    "tables": {
        "d_percentiles": {
            "example-service-1": {
                "example-operation-1": 50000
            }
        }
    }
}

Note that we now need a tables.d_percentiles dict that can be declared in the configuration. This dict can be dynamically calculated by an "adaptive sampling calculator". It defines a per end-point expected maximum duration, beyond which sampling will be forced (e.g. by a const: 1 strategy). We might want to tag this traffic not just as sampled but also as force sampled so that we ignore it while we calculate percentiles etc. Uniform Sampling can still be used for percentiles and other such statistics, while this extra type of traffic might be used to ensure we don't miss important outliers.

Since we now define and use a Span model, it would be good to also formalize a few more attributes it could have and be potentially useful. For example, we can have "span.parent.operationName", " span.parent.parent.process.seviceName" of a Span (parent might need to be an array - but let's skip this for now). Those don't need to be supported for any middleware or configuration, but if a middleware supports passing downstream some tags or Span attributes (e.g. through baggage) in-band, it's nice to know where to find them. With this extension, one can write rules that sample at a given operation and parent operation. As mentioned before, it might be possible to do this also, when you make the request, on the caller, instead of on the child-span, but parent provides another way to do this.

As a natural extension on the above, we can define an attribute span.child. child might also be an array and have arbitrary attributes but let's skip this for now. An important attribute might be span.child.sampled. This will be true if a child decided to sample itself and false otherwise. This makes sense, because under the extensions we described above, a Span can decide to sample itself at several different points in time (instead of just the beginning) if it detects interesting conditions. Many types of RPC middleware are able to return information upstream e.g. HTTP response headers. This means that a Span might be updated with an up-to-date value for span.child.sampled when a response from a sampled child span is detected. This could be triggered by an error condition on the child, in which case if the current span also decides to sample and propagate upstream, we can have the "call stack" and also sample any subsequent Spans. Those traces might have unusual shape because of missing information and highly likely shouldn't be used for aggregate analytics but they can still be used for alerting and debugging.

@yurishkuro
Copy link
Member Author

@lookfwd great proposal! We're actively discussing it right now for an internal project. One major issue we bumped into with it is this: when possible strategies are represented as a list and need to be matched one by one for a given span, it works pretty well if the match process runs only once. But in your proposal there's no specific demarcation event that tells the tracer "do it now", instead matching can run multiple times as tags are being added to the span. The problem is that a list of strategies would typically include a default fallback strategy when nothing custom matches, and the default strategy will always match, even on the first try, so there won't be time to set span tags and potentially match any other custom strategies.

We could introduce the demarcation event artificially, e.g. as some static method in Jaeger or a special span tag that can be used to signal "apply sampling rules now". But it's kind of ugly and requires additional instrumentation in the code.

Thoughts?

@lookfwd
Copy link

lookfwd commented Sep 6, 2019

@yurishkuro - sorry, I missed it. The points when one needs to know if sampling is true or false, is when the context is about to be injected to a carrier or when one is about to finish the Span. If those are used as "apply sampling rules now" trigger(s) i.e. lazily evaluating should_sample() when one needs it, should give good full or partial traces as far as I can see.

More specifically I would expect the user/framework to set all the tags before it injects or finish()'s.

default fallback strategy when nothing custom matches, and the default strategy will always match, even on the first try

If the default strategy is a lower bound rate limiter, it should be sufficient to inform about the existence of "weird" spans that e.g. set tags after injecting, without flooding the system.

@yurishkuro
Copy link
Member Author

@lookfwd you might be interested in these two PRs (jaegertracing/jaeger-client-node#377, jaegertracing/jaeger-client-node#380), which introduce a shared sampling state for spans in-process and allow delaying sampling decision.

@stevejobsmyguru
Copy link

@lookfwd great proposal! We're actively discussing it right now for an internal project. One major issue we bumped into with it is this: when possible strategies are represented as a list and need to be matched one by one for a given span, it works pretty well if the match process runs only once. But in your proposal there's no specific demarcation event that tells the tracer "do it now", instead matching can run multiple times as tags are being added to the span. The problem is that a list of strategies would typically include a default fallback strategy when nothing custom matches, and the default strategy will always match, even on the first try, so there won't be time to set span tags and potentially match any other custom strategies.

We could introduce the demarcation event artificially, e.g. as some static method in Jaeger or a special span tag that can be used to signal "apply sampling rules now". But it's kind of ugly and requires additional instrumentation in the code.

Thoughts?

I have some thoughts in high level in the following lines:

At the Collector level, For each Host level, we can run some kind of baseline calculation on key KPIs like # of Error, 90th Response Time or Throughput for every < X> Observation window. This observation interval can be like for every 5 min or every 10 mins or < auto-calculated>. This Observation window can be auto-calculated from Throughput ( or Operation /Sec at every Service Level). If the base line is breached from previous observation window, then we can trigger adaptive sampling from Collector to Agent thru callback function to Agent for each service level. I mean Agent will send to collector, If all is safe within the baseline, Agent will purge within its perimeter. The point here is, there may be broken parent or broken childs (or Spans) for completing end-2-end transaction. Of course , it will be complex design though .

@yurishkuro: I just tried my imagination on high level design. Pls. go thru and take your own call.

@agaudreault
Copy link

agaudreault commented Jun 3, 2020

I am not sure where to post this (I also asked in Gitter), but I will ask here since it seems like a problem with adaptive sampling and the standard implementation with the client libraries.

In our collector, we define the remote sampling configuration strategies and for some default endpoints (/health, /metrics, etc.), we set the probabilistic sampling rate of 0.0 (Pretty much like the example in the jaeger documentation). From what I found when debugging, the client librairies will create a GuaranteedThroughputSampler whenever a list of operations is received. The default lowerbound value will always be of 0, but when the RateLimiterSampler is created, if the value is smaller than 1, the default value will be 1. This cause the /health, /metrics, etc. endpoints to be sampled at a rate of 1 TPS.

Is there currently a way to disable the sampling for specific endpoints ? Or perhaps it is a bug that that the RateLimiterSampler cannot be instantiated with a value of 0 ?

We are using Jaeger collector 1.8 and the latest release of jaeger-client-java and jaeger-client-go.

@yurishkuro
Copy link
Member Author

@agaudreault-jive it's probably a limitation of the data model of GuaranteedThroughputSampler - it only supports probability value per-endpoint, while the lowerbound rate limiter applies across all endpoints.

// OperationSamplingStrategy defines a sampling strategy that randomly samples a fixed percentage of operation traces.
struct OperationSamplingStrategy {
    1: required string operation
    2: required ProbabilisticSamplingStrategy probabilisticSampling
}

// PerOperationSamplingStrategies defines a sampling strategy per each operation name in the service
// with a guaranteed lower bound per second. Once the lower bound is met, operations are randomly sampled
// at a fixed percentage.
struct PerOperationSamplingStrategies {
    1: required double defaultSamplingProbability
    2: required double defaultLowerBoundTracesPerSecond
    3: required list<OperationSamplingStrategy> perOperationStrategies
    4: optional double defaultUpperBoundTracesPerSecond
}

It's possible to extend the model, but will require pretty substantial changes.

BTW, 1 TPS seems very high for lowerbound, we're using several orders of magnitude smaller value.

@Mario-Hofstaetter
Copy link

Any update on this feature? There are many merged pull-requests. Will this be finished?
Or will this be obsolete / superseeded by work done in the new opentelemetry collector?

@Ashmita152
Copy link
Contributor

Hi @yurishkuro

I don't know if my skillset is good enough to solve this ticket but I would like to take a stab at it. I would like to ask you if you can summarise the work which is remaining here. My understanding from looking at the two PRs is that we just need to invoke the adaptive sampling processor from collector/main.go.

Thank you.

@yurishkuro
Copy link
Member Author

@Ashmita152 you're correct, I think pretty much all of the code is already in the repo, it just needs hooking up in the collector and exposing configuration parameters via flags. It would be fantastic if we can get this in, this last piece was outstanding for over 2yrs.

@Ashmita152
Copy link
Contributor

Sure Yuri, I will give it a try. Thank you.

@Yolo-Hao
Copy link

any news? work now? or any plan?

@yurishkuro
Copy link
Member Author

@joe-elliott picked this up in #2966

@mrsuperguo
Copy link

I've made some try in my project. I combined rate sampling and rate limiter together. At the mean time, I sampled pod cpu&mem status to auto adjust sampling rate. It goes well in my production env.

@yurishkuro
Copy link
Member Author

FYI https://medium.com/jaegertracing/adaptive-sampling-in-jaeger-50f336f4334

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.