Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add automatic tail sampling based on latency and error count thresholds #1080

Merged
merged 1 commit into from
Nov 17, 2021

Conversation

ivantopo
Copy link
Contributor

@ivantopo ivantopo commented Nov 16, 2021

I originally shared this idea on our Discord server and I'm copy/pasting it here:

There is an idea that came to mind after chatting with one of our users and related to trying to keep "interesting" traces only.. where "interesting" means that they are either high latency or contain errors. We already got a span-reporting-delay setting (see https://github.com/kamon-io/Kamon/blob/master/core/kamon-core/src/main/resources/reference.conf#L148-L156) a few versions ago that can keep Spans in memory for a few extra seconds, so that users can add a bit of extra logic in their controllers and decide whether to keep or drop a trace. The issue there is that users MUST add some extra logic by hand.

I would like to add three things here:

  • An error counter on our Trace object. This counter will be incremented when any Span for a trace is marked as failed.
  • A small piece of logic that runs after Spans finish and figures out whether the trace should be marked for keeping or not.
  • A configuration format that allows users to set this up with configuration only. Something like saying "try to keep all traces with at least 2 errors, or with latency of 1+ seconds"

For starters I think we can apply this logic only when Local Root spans are finished, and then we can try to expand this with a bit more functionality, like setting different rules for different operations, or maybe even trying to keep traces for different latency buckets. Many ideas come to mind but I think it would be good to keep it simple and see where it goes 😄

This would work really well for people using Kamon for monolith-like applications, but the "forced" traces would be partial if you are doing distributed traces. Still, I think it is better to have a partial trace than no trace at all. If you have the trace and trace ID maybe you can gather some extra data from logs if they are correlated. And, not everybody is doing microservices!

This PR is implementing the idea from above, allowing to force keep traces with a minimum error count or latency threshold. I still need to write some tests and ensure everything works fine. Fortunately the implementation is pretty simple.

@ivantopo ivantopo force-pushed the local-tail-based-sampling branch from b954161 to d92118f Compare November 16, 2021 18:33
@ivantopo
Copy link
Contributor Author

Got the tests in and rebased from master. If nothing special comes up I'll merge this tomorrow and put out a minor release.

@ivantopo ivantopo force-pushed the local-tail-based-sampling branch from 519dd7d to df2bad0 Compare November 16, 2021 21:36
@dpsoft
Copy link
Contributor

dpsoft commented Nov 17, 2021

@ivantopo awesome! and the most important, is really simple and useful

@ivantopo ivantopo force-pushed the local-tail-based-sampling branch from df2bad0 to 6069c0c Compare November 17, 2021 14:50
@ivantopo ivantopo merged commit 25bb796 into kamon-io:master Nov 17, 2021
@ivantopo ivantopo deleted the local-tail-based-sampling branch November 17, 2021 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants