-
-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak when OpenTelemetry spans get filtered #3198
Conversation
|
||
private bool NeedsPruning() | ||
{ | ||
lock (_pruningLock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads as potential contention code and since we're just doing some path with the date is there a way we can do something atomic here instead with Interlocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: unix timestamp as seconds and interlocked.compareExchange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've replaced lock with some rough equivalent using Interlocked... In the Benchmarks I ran, I wasn't able to detect a significant performance difference between the two, but that might have been a shortcoming in the Benchmark itself. I'll have another crack at it as intuitively the two should behave quite differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I put together some new benchmarks... interestingly, lock
actually comes out a bit faster than either Interlocked
or SemaphoreSlim
. The difference isn't wild and I'm not sure how much I trust these - I'm worried there's some error in my design of the benchmark tests themselves.
Code here however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interestingly, lock actually comes out a bit faster than either Interlocked or SemaphoreSlim.
I really don't trust the benchmark then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the link 404's to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops... that repo was private. Just made it public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're not alone in this getsentry/sentry-python#2722
It might be worth checking back how it gets solved there.
Interesting... But this fix doesn't affect the public API so no need to align with the Python SDK, I don't think Be good to review and merge ASAP as it's quite a serious issue for anyone using filtering with OTEL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it fixes the problem lets go with it, possibly lock contetion is better than certain OOM crash
Fixes #3197
Solution
For the Sentry spans we hold in
SentrySpanProcessor._map
, we need a way to detect when the associated Activities have been filtered out. The OpenTelemetry SDK doesn't provide a mechanism for I had to get a bit creative.Previously we only held on to the Activity.Id for OpenTelemetry spans that we'd processed and there's no way to lookup an Activity by Id.
With this PR we now retain weak references to the Activities themselves (not just the Activity.Id)... and we periodically loop through each of the spans in our
_map
to check if the associated Activities have been filtered out. When Activities get filtered,IsAllDataRequested
gets set and theActivityTraceFlags.Recorded
flag gets removed (see code). When that happens, we know we're not going to receive anymore OnStart/OnEnd events relating to the Activity, so we can remove it from our map.Additionally, when finishing transactions that were instrumented using OpenTelemetry, we now surgically remove any spans that were filtered... since although those spans were removed from the
_map
they still exist in theSentryTransaction.Spans
.Testing Manually
To test this manually, this line in our sample project can be changed to:
That basically filters every outgoing HTTP request. You can then see in resulting traces that these have magically disappeared (there's a suspicious gap in the timings, but there's no trace event for these).
Before this PR, this was not the case... we were seeing:
a) A bunch of unfinished spans relating to outgoing HTTP requests that were filtered by OTEL
b) An accumulation of those spans in our SentrySpanProcessor, resulting in ever increasing memory consumption