Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak when OpenTelemetry spans get filtered #3198

Merged
merged 5 commits into from
Mar 8, 2024

Conversation

jamescrosswell
Copy link
Collaborator

@jamescrosswell jamescrosswell commented Mar 6, 2024

Fixes #3197

Solution

For the Sentry spans we hold in SentrySpanProcessor._map, we need a way to detect when the associated Activities have been filtered out. The OpenTelemetry SDK doesn't provide a mechanism for I had to get a bit creative.

Previously we only held on to the Activity.Id for OpenTelemetry spans that we'd processed and there's no way to lookup an Activity by Id.

With this PR we now retain weak references to the Activities themselves (not just the Activity.Id)... and we periodically loop through each of the spans in our _map to check if the associated Activities have been filtered out. When Activities get filtered, IsAllDataRequested gets set and the ActivityTraceFlags.Recorded flag gets removed (see code). When that happens, we know we're not going to receive anymore OnStart/OnEnd events relating to the Activity, so we can remove it from our map.

Additionally, when finishing transactions that were instrumented using OpenTelemetry, we now surgically remove any spans that were filtered... since although those spans were removed from the _map they still exist in the SentryTransaction.Spans.

Testing Manually

To test this manually, this line in our sample project can be changed to:

            .AddHttpClientInstrumentation(o => o.FilterHttpRequestMessage = _ => false)

That basically filters every outgoing HTTP request. You can then see in resulting traces that these have magically disappeared (there's a suspicious gap in the timings, but there's no trace event for these).

Before this PR, this was not the case... we were seeing:
a) A bunch of unfinished spans relating to outgoing HTTP requests that were filtered by OTEL
b) An accumulation of those spans in our SentrySpanProcessor, resulting in ever increasing memory consumption


private bool NeedsPruning()
{
lock (_pruningLock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as potential contention code and since we're just doing some path with the date is there a way we can do something atomic here instead with Interlocked?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: unix timestamp as seconds and interlocked.compareExchange

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced lock with some rough equivalent using Interlocked... In the Benchmarks I ran, I wasn't able to detect a significant performance difference between the two, but that might have been a shortcoming in the Benchmark itself. I'll have another crack at it as intuitively the two should behave quite differently.

Copy link
Collaborator Author

@jamescrosswell jamescrosswell Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I put together some new benchmarks... interestingly, lock actually comes out a bit faster than either Interlocked or SemaphoreSlim. The difference isn't wild and I'm not sure how much I trust these - I'm worried there's some error in my design of the benchmark tests themselves.

Code here however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly, lock actually comes out a bit faster than either Interlocked or SemaphoreSlim.

I really don't trust the benchmark then :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the link 404's to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops... that repo was private. Just made it public.

@jamescrosswell jamescrosswell marked this pull request as ready for review March 7, 2024 06:16
Copy link
Contributor

@bitsandfoxes bitsandfoxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're not alone in this getsentry/sentry-python#2722
It might be worth checking back how it gets solved there.

@jamescrosswell
Copy link
Collaborator Author

Looks like we're not alone in this getsentry/sentry-python#2722
It might be worth checking back how it gets solved there.

Interesting... But this fix doesn't affect the public API so no need to align with the Python SDK, I don't think

Be good to review and merge ASAP as it's quite a serious issue for anyone using filtering with OTEL.

Copy link
Member

@bruno-garcia bruno-garcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it fixes the problem lets go with it, possibly lock contetion is better than certain OOM crash

@bruno-garcia bruno-garcia merged commit a448cc4 into main Mar 8, 2024
33 checks passed
@bruno-garcia bruno-garcia deleted the otel-filtered-spans branch March 8, 2024 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Http.HttpRequestOut OpenTelemetry spans not finished if filtered
3 participants