Fix memory leak when OpenTelemetry spans get filtered #3198

jamescrosswell · 2024-03-06T08:17:32Z

Solution

For the Sentry spans we hold in SentrySpanProcessor._map, we need a way to detect when the associated Activities have been filtered out. The OpenTelemetry SDK doesn't provide a mechanism for I had to get a bit creative.

Previously we only held on to the Activity.Id for OpenTelemetry spans that we'd processed and there's no way to lookup an Activity by Id.

With this PR we now retain weak references to the Activities themselves (not just the Activity.Id)... and we periodically loop through each of the spans in our _map to check if the associated Activities have been filtered out. When Activities get filtered, IsAllDataRequested gets set and the ActivityTraceFlags.Recorded flag gets removed (see code). When that happens, we know we're not going to receive anymore OnStart/OnEnd events relating to the Activity, so we can remove it from our map.

Additionally, when finishing transactions that were instrumented using OpenTelemetry, we now surgically remove any spans that were filtered... since although those spans were removed from the _map they still exist in the SentryTransaction.Spans.

Testing Manually

To test this manually, this line in our sample project can be changed to:

            .AddHttpClientInstrumentation(o => o.FilterHttpRequestMessage = _ => false)

That basically filters every outgoing HTTP request. You can then see in resulting traces that these have magically disappeared (there's a suspicious gap in the timings, but there's no trace event for these).

Before this PR, this was not the case... we were seeing:
a) A bunch of unfinished spans relating to outgoing HTTP requests that were filtered by OTEL
b) An accumulation of those spans in our SentrySpanProcessor, resulting in ever increasing memory consumption

bruno-garcia · 2024-03-06T23:46:44Z

src/Sentry.OpenTelemetry/SentrySpanProcessor.cs

+
+    private bool NeedsPruning()
+    {
+        lock (_pruningLock)


This reads as potential contention code and since we're just doing some path with the date is there a way we can do something atomic here instead with Interlocked?

suggestion: unix timestamp as seconds and interlocked.compareExchange

I've replaced lock with some rough equivalent using Interlocked... In the Benchmarks I ran, I wasn't able to detect a significant performance difference between the two, but that might have been a shortcoming in the Benchmark itself. I'll have another crack at it as intuitively the two should behave quite differently.

OK, I put together some new benchmarks... interestingly, lock actually comes out a bit faster than either Interlocked or SemaphoreSlim. The difference isn't wild and I'm not sure how much I trust these - I'm worried there's some error in my design of the benchmark tests themselves.

Code here however.

interestingly, lock actually comes out a bit faster than either Interlocked or SemaphoreSlim.

I really don't trust the benchmark then :)

the link 404's to me

Whoops... that repo was private. Just made it public.

bitsandfoxes

Looks like we're not alone in this getsentry/sentry-python#2722
It might be worth checking back how it gets solved there.

jamescrosswell · 2024-03-07T19:46:27Z

Looks like we're not alone in this getsentry/sentry-python#2722
It might be worth checking back how it gets solved there.

Interesting... But this fix doesn't affect the public API so no need to align with the Python SDK, I don't think

Be good to review and merge ASAP as it's quite a serious issue for anyone using filtering with OTEL.

bruno-garcia

If it fixes the problem lets go with it, possibly lock contetion is better than certain OOM crash

jamescrosswell added 3 commits March 6, 2024 21:15

Fix memory leak with OTEL filtered spans

1e3f139

Update CHANGELOG.md

a282e34

Added tests for code to clean up the _map

5fdf89d

bruno-garcia reviewed Mar 6, 2024

View reviewed changes

jamescrosswell added 2 commits March 7, 2024 16:02

Replaced lock with Interlocked

336aa1a

Added FromTracerSpans tests

e2a688d

jamescrosswell marked this pull request as ready for review March 7, 2024 06:16

jamescrosswell requested a review from bitsandfoxes as a code owner March 7, 2024 06:16

bitsandfoxes approved these changes Mar 7, 2024

View reviewed changes

bruno-garcia approved these changes Mar 8, 2024

View reviewed changes

bruno-garcia merged commit a448cc4 into main Mar 8, 2024
33 checks passed

bruno-garcia deleted the otel-filtered-spans branch March 8, 2024 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak when OpenTelemetry spans get filtered #3198

Fix memory leak when OpenTelemetry spans get filtered #3198

jamescrosswell commented Mar 6, 2024 •

edited

Loading

bruno-garcia Mar 6, 2024

bruno-garcia Mar 6, 2024

jamescrosswell Mar 7, 2024

jamescrosswell Mar 7, 2024 •

edited

Loading

bruno-garcia Mar 8, 2024

bruno-garcia Mar 8, 2024

jamescrosswell Mar 8, 2024

bitsandfoxes left a comment

jamescrosswell commented Mar 7, 2024

bruno-garcia left a comment

Fix memory leak when OpenTelemetry spans get filtered #3198

Fix memory leak when OpenTelemetry spans get filtered #3198

Conversation

jamescrosswell commented Mar 6, 2024 • edited Loading

Solution

Testing Manually

bruno-garcia Mar 6, 2024

Choose a reason for hiding this comment

bruno-garcia Mar 6, 2024

Choose a reason for hiding this comment

jamescrosswell Mar 7, 2024

Choose a reason for hiding this comment

jamescrosswell Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

bruno-garcia Mar 8, 2024

Choose a reason for hiding this comment

bruno-garcia Mar 8, 2024

Choose a reason for hiding this comment

jamescrosswell Mar 8, 2024

Choose a reason for hiding this comment

bitsandfoxes left a comment

Choose a reason for hiding this comment

jamescrosswell commented Mar 7, 2024

bruno-garcia left a comment

Choose a reason for hiding this comment

jamescrosswell commented Mar 6, 2024 •

edited

Loading

jamescrosswell Mar 7, 2024 •

edited

Loading