-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak in MessageBrokerSegmentData / Segment #2986
Comments
I've opened a case under account Cheers, |
The stack is:-
Cheers, |
Dug into a few random ones using
|
|
Looks like we have two issues:
For the first issue, I noticed a few things about the
I think the issue with the distributed tracing element is that it is duplicated. For the errorCollector element, I'd suggest grabbing the default config and pasting it into your config file, then just setting enabled to false, e.g.:
Since this config file is failing validation, the disabling of the agent might not be working as expected. At any rate, it looks like you've figured out that disabling CLR profiling entirely ( Regarding the second issue about memory growth/hoarding in message broker segments associated with AWS SQS calls:
|
Thanks for flagging that, that is now all fixed:- <?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc. All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="false">
<appSettings>
<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
</appSettings>
<service licenseKey="REDACTED"/>
<application/>
<log level="off" enabled="false"/>
<allowAllHeaders enabled="true"/>
<attributes enabled="true">
<exclude>request.headers.cookie</exclude>
<exclude>request.headers.authorization</exclude>
<exclude>request.headers.proxy-authorization</exclude>
<exclude>request.headers.x-*</exclude>
<include>request.headers.*</include>
</attributes>
<transactionTracer enabled="false"/>
<distributedTracing enabled="false"/>
<errorCollector enabled="false">
<ignoreClasses>
<errorClass>System.IO.FileNotFoundException</errorClass>
<errorClass>System.Threading.ThreadAbortException</errorClass>
</ignoreClasses>
<ignoreStatusCodes>
<code>401</code>
<code>404</code>
</ignoreStatusCodes>
</errorCollector>
<browserMonitoring autoInstrument="false"/>
<threadProfiling>
<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
</threadProfiling>
<applicationLogging enabled="false"/>
<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
<slowSql enabled="false"/>
<codeLevelMetrics enabled="false"/>
</configuration> And it passes [user@INDY-PC .NET Agent]$ xmllint --schema ./newrelic.xsd ./newrelic.config --noout
./newrelic.config validates I wonder if this is because we left the appsettings as:- {
"Logging": {
"LogLevel": {
"Default": "Information",
"Log4Net": "Debug",
"Microsoft": "Warning",
"Microsoft.Hosting.Lifetime": "Information"
}
},
"AllowedHosts": "*",
"FavouriteColour": "Green",
"NewRelic.AppName": "DEMO_Propono",
"NewRelic.AgentEnabled": "true"
} Which according to the documentation:-
If that is true, why is the rest of the agent (seemingly) disabled, but only that segment stuff is active? I shall flick the
Happy to do this if toggling the Cheers, |
I think that's because the global
Your apps will need to be restarted for any of those three changes to take effect. Our team spent some time digging into one of the dump files you provided (thanks for those). We focused on
Here are screenshots showing some of the above: Overall, this indicates some kind of a bug in our agent regarding how we are handling SQS operations within an ASP.NET Core web context for your particular scenario. While I understand that your primary concern is getting rid of the memory growth in your demo environment by following one of the three options discussed above, we would appreciate it if you could give us a redacted code sample showing how your application is interacting with SQS within a web endpoint, so that we can fix this bug. Is it creating background tasks by any chance? |
Based on the information in the dump file I suspect that the transactions are started by some sort of web service request. These web service requests are in turn starting up some long running background jobs. What I think is happening is the following:
The code samples would help us with confirming this theory, as well as help us with determining why the SQS segments are not ending. It would also be good to know if there is any custom instrumentation being used (such as When background work is started, our agent is unable to determine, automatically, if the work being started is meant to be a long running job, or if a transaction should or should not flow to that background job (this decision typically requires application specific knowledge). Some APIs for starting background jobs capture the current |
Happy to share the code we've written around SNS/SQS, might take a while to get it into a shareable state. My guess this is more to do with the SQS side of receiving / listening for messages using long polling.
This has indeed stopped the memory leak entirely.
That sounds about right. The app in question "propono" has many queues and once the app is started it listens for messages from SQS (e.g ReceiveMessageAsync) in a long living background task (each task lives for the lifetime of the application), and there is a listening task per queue. We don't do anything fancy from memory IIRC we do something like this:- _runningTasks[queueUrl] = Task.Run(() => ListenForMessages(queueUrl);
No custom instrumentation, no custom xml instrumentation files either. Cheers, |
Using the This problem is not unique to New Relic, because the same problem will happen with OpenTelemetry where the current Activity (System.Diagnostics.DiagnosticSource.Activity) which represents the current span, is stored in async local storage, which will also be captured by the Task.Run api, making any new activities/spans created and started in that background task have that captured span/activity as its parent, which may not be desired. The main difference is that the New Relic data model is based on Transactions, which is meant to summarize everything that happens during the life of a transaction. As a result, long-running transactions do not work well in this data model, because you can't see what's going on in that long running job until the entire transaction is done. That's where custom instrumentation is required in order to get more meaningful transactions that are shorter in duration that can be used to help you understand what's going on in your application. |
Hi, I've attached our internal code responsible for managing our interaction with SNS/SQS under account Cheers, |
Based on your description of the code as well as the code samples that you provided, I think that my suspicion about what is happening is correct. That is, the transaction created to track the work in the async request is being captured by the call to Task.Run in the SNS/SQS code, and as a result the transaction is being kept alive indefinitely. There are a few ways to work around this problem, and they each have their own set of tradeoffs.
Unfortunately there isn't an automatic way for the agent to detect this type of scenario. Eventually the agent should detect that too many segments were added to a transaction and manually end it, but that doesn't solve the problem. |
Currently we have the New Relic dotnet agent installed on a DEMO environment but it is disabled (see this thread for more context).
We recently reviewed our memory usages across out DEMO environment and spotted this leak:-
This sawtooth pattern coincides with our deployment of NR to our DEMO environment.
The odd thing is that the agent is currently disabled on DEMO with this config:-
We've taken a few dumps over the weekend (when no deploys occur) and the process isn't recycled:-
To test out the theory if was indeed the dotnet NR agent we set:
On one of the four boxes (DEMOIIS102) and you can see when that change was made from the clear drop.
We've collected a few dumps:-
I'll get these upload to the NR support portal shortly.
I did take a quick look using dotnet-dump using
dumpheap -stat
:-Cheers,
Indy
The text was updated successfully, but these errors were encountered: