add automatic tail sampling based on latency and error count thresholds #1080

ivantopo · 2021-11-16T12:53:38Z

I originally shared this idea on our Discord server and I'm copy/pasting it here:

There is an idea that came to mind after chatting with one of our users and related to trying to keep "interesting" traces only.. where "interesting" means that they are either high latency or contain errors. We already got a span-reporting-delay setting (see https://github.com/kamon-io/Kamon/blob/master/core/kamon-core/src/main/resources/reference.conf#L148-L156) a few versions ago that can keep Spans in memory for a few extra seconds, so that users can add a bit of extra logic in their controllers and decide whether to keep or drop a trace. The issue there is that users MUST add some extra logic by hand.

I would like to add three things here:

An error counter on our Trace object. This counter will be incremented when any Span for a trace is marked as failed.

A small piece of logic that runs after Spans finish and figures out whether the trace should be marked for keeping or not.

A configuration format that allows users to set this up with configuration only. Something like saying "try to keep all traces with at least 2 errors, or with latency of 1+ seconds"

For starters I think we can apply this logic only when Local Root spans are finished, and then we can try to expand this with a bit more functionality, like setting different rules for different operations, or maybe even trying to keep traces for different latency buckets. Many ideas come to mind but I think it would be good to keep it simple and see where it goes 😄

This would work really well for people using Kamon for monolith-like applications, but the "forced" traces would be partial if you are doing distributed traces. Still, I think it is better to have a partial trace than no trace at all. If you have the trace and trace ID maybe you can gather some extra data from logs if they are correlated. And, not everybody is doing microservices!

This PR is implementing the idea from above, allowing to force keep traces with a minimum error count or latency threshold. I still need to write some tests and ensure everything works fine. Fortunately the implementation is pretty simple.

core/kamon-core/src/main/scala/kamon/trace/Trace.scala

ivantopo · 2021-11-16T18:34:33Z

Got the tests in and rebased from master. If nothing special comes up I'll merge this tomorrow and put out a minor release.

core/kamon-core/src/main/resources/reference.conf

dpsoft · 2021-11-17T12:06:49Z

@ivantopo awesome! and the most important, is really simple and useful

ihostage reviewed Nov 16, 2021

View reviewed changes

core/kamon-core/src/main/scala/kamon/trace/Trace.scala Outdated Show resolved Hide resolved

ivantopo force-pushed the local-tail-based-sampling branch from b954161 to d92118f Compare November 16, 2021 18:33

ivantopo commented Nov 16, 2021

View reviewed changes

core/kamon-core/src/main/resources/reference.conf Outdated Show resolved Hide resolved

core/kamon-core/src/main/resources/reference.conf Show resolved Hide resolved

ivantopo force-pushed the local-tail-based-sampling branch from 519dd7d to df2bad0 Compare November 16, 2021 21:36

add automatic tail sampling based on latency and error count thresholds

6069c0c

ivantopo force-pushed the local-tail-based-sampling branch from df2bad0 to 6069c0c Compare November 17, 2021 14:50

ivantopo merged commit 25bb796 into kamon-io:master Nov 17, 2021

ivantopo deleted the local-tail-based-sampling branch November 17, 2021 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add automatic tail sampling based on latency and error count thresholds #1080

add automatic tail sampling based on latency and error count thresholds #1080

ivantopo commented Nov 16, 2021 •

edited

Loading

ivantopo commented Nov 16, 2021

dpsoft commented Nov 17, 2021

add automatic tail sampling based on latency and error count thresholds #1080

add automatic tail sampling based on latency and error count thresholds #1080

Conversation

ivantopo commented Nov 16, 2021 • edited Loading

ivantopo commented Nov 16, 2021

dpsoft commented Nov 17, 2021

ivantopo commented Nov 16, 2021 •

edited

Loading