Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak in MessageBrokerSegmentData / Segment #2986

Open
indy-singh opened this issue Feb 10, 2025 · 13 comments
Open

Memory Leak in MessageBrokerSegmentData / Segment #2986

indy-singh opened this issue Feb 10, 2025 · 13 comments
Labels
bug Something isn't working community To tag external issues and PRs

Comments

@indy-singh
Copy link

Currently we have the New Relic dotnet agent installed on a DEMO environment but it is disabled (see this thread for more context).

We recently reviewed our memory usages across out DEMO environment and spotted this leak:-

Image

This sawtooth pattern coincides with our deployment of NR to our DEMO environment.

The odd thing is that the agent is currently disabled on DEMO with this config:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="false">
  <appSettings>
    <add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
  </appSettings>
  <service licenseKey="REDACTED"/>
  <application/>
  <log level="off" enabled="false"/>
  <allowAllHeaders enabled="true"/>
  <attributes enabled="true">
    <exclude>request.headers.cookie</exclude>
    <exclude>request.headers.authorization</exclude>
    <exclude>request.headers.proxy-authorization</exclude>
    <exclude>request.headers.x-*</exclude>
    <include>request.headers.*</include>
  </attributes>
  <transactionTracer enabled="false"/>
  <distributedTracing enabled="false"/>
  <errorCollector enabled="false"/>
  <browserMonitoring autoInstrument="false"/>
  <threadProfiling>
    <ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
    <ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
  </threadProfiling>
  <applicationLogging enabled="false"/>
  <utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
  <slowSql enabled="false"/>
  <distributedTracing enabled="false"/>
  <codeLevelMetrics enabled="false"/>
</configuration>

We've taken a few dumps over the weekend (when no deploys occur) and the process isn't recycled:-

Image

To test out the theory if was indeed the dotnet NR agent we set:

CORECLR_ENABLE_PROFILING=0

On one of the four boxes (DEMOIIS102) and you can see when that change was made from the clear drop.

We've collected a few dumps:-

2025-02-07T23.48.33.7081789+00.00_DEMOIIS102_propono.blue.zip
2025-02-08T21.43.06.3628835+00.00_DEMOIIS102_propono.blue.zip
2025-02-09T22.42.44.0685293+00.00_DEMOIIS102_propono.blue.zip
2025-02-10T07.27.43.1452122+00.00_DEMOIIS102_propono.blue.zip

I'll get these upload to the NR support portal shortly.

I did take a quick look using dotnet-dump using dumpheap -stat:-

7ffb3754e6a8 113,394 10,885,824 NewRelic.Agent.Core.Segments.MessageBrokerSegmentData
7ffb370c6968 115,358 17,534,416 NewRelic.Agent.Core.Segments.Segment

Cheers,
Indy

@indy-singh indy-singh added the bug Something isn't working label Feb 10, 2025
@workato-integration
Copy link

@github-actions github-actions bot added the community To tag external issues and PRs label Feb 10, 2025
@indy-singh
Copy link
Author

I've opened a case under account 3110122 with case reference #00248564. There isn't much on the ticket except a link to this thread and zipped memory dumps.

Cheers,
Indy

@indy-singh
Copy link
Author

The stack is:-

  • on prem
  • dotnet 8 apps hosted in IIS
  • windows server 2022
  • nr infra agent 1.57.1 (disabled)
  • nr dotnet agent 10.30.0.0 (disabled)

Cheers,
Indy

@indy-singh
Copy link
Author

Dug into a few random ones using dotnet-dump and they all appear to be AWS/SQS related:-

> dumpheap -stat
7ffb3754e6a8   674,422  64,744,512 NewRelic.Agent.Core.Segments.MessageBrokerSegmentData
7ffb370c6968   678,318 103,104,336 NewRelic.Agent.Core.Segments.Segment
> dumpheap -mt 7ffb3754e6a8
...
    022960e56b88     7ffb3754e6a8             96
    022960e56e20     7ffb3754e6a8             96
    022960e57050     7ffb3754e6a8             96
    022960e57278     7ffb3754e6a8             96
    022960e58268     7ffb3754e6a8             96
    022960e58498     7ffb3754e6a8             96
...
> do 022960e56b88
Name:        NewRelic.Agent.Core.Segments.MessageBrokerSegmentData
MethodTable: 00007ffb3754e6a8
EEClass:     00007ffb37553080
Tracked Type: false
Size:        96(0x60) bytes
File:        C:\Program Files\New Relic\.NET Agent\netcore\NewRelic.Agent.Core.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffb370c6188  40003ec        8 ...ISegmentDataState  0 instance 0000022960e56ad8 _segmentState
00007ffb3125cf88  4000412       10        System.String  0 instance 00000269ecad2180 <Vendor>k__BackingField
00007ffb3125cf88  4000413       18        System.String  0 instance 0000022960e569d8 <Destination>k__BackingField
00007ffb3754e308  4000414       48         System.Int32  1 instance                0 <DestinationType>k__BackingField
00007ffb3754e288  4000415       4c         System.Int32  1 instance                1 <Action>k__BackingField
00007ffb3125cf88  4000416       20        System.String  0 instance 00000269ecb7e4d0 <MessagingSystemName>k__BackingField
00007ffb3125cf88  4000417       28        System.String  0 instance 0000022960e569a8 <CloudAccountId>k__BackingField
00007ffb3125cf88  4000418       30        System.String  0 instance 0000022960e56a88 <CloudRegion>k__BackingField
00007ffb3125cf88  4000419       38        System.String  0 instance 0000000000000000 <ServerAddress>k__BackingField
00007ffb31333298  400041a       50 ...Private.CoreLib]]  1 instance 0000022960e56bd8 <ServerPort>k__BackingField
00007ffb3125cf88  400041b       40        System.String  0 instance 0000000000000000 <RoutingKey>k__BackingField
> do 00000269ecad2180
Name:        System.String
MethodTable: 00007ffb3125cf88
EEClass:     00007ffb3123a2c0
Tracked Type: false
Size:        28(0x1c) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\8.0.6\System.Private.CoreLib.dll
String:      SQS
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffb311e0928  400033b        8         System.Int32  1 instance                3 _stringLength
00007ffb311eab40  400033c        c          System.Char  1 instance               53 _firstChar
00007ffb3125cf88  400033a       c8        System.String  0   static 00000269ec9c0008 Empty

@nr-ahemsath
Copy link
Member

nr-ahemsath commented Feb 10, 2025

I've opened a case under account 3110122 with case reference #00248564. There isn't much on the ticket except a link to this thread and zipped memory dumps.

Cheers, Indy

Since we have an internal support case for this, we'll work the problem there, and update this issue with our findings once we have them. Never mind, there's more info here and it's probably better that we interact directly.

@nr-ahemsath
Copy link
Member

Looks like we have two issues:

  1. Why is there any agent activity when the agent is supposed to be disabled?
  2. Why is memory usage growing associated with message broker/queue data segments?

For the first issue, I noticed a few things about the newrelic.config file you shared and I ran it through XML validation against our XSD file and it came up with this:

PS C:\workspace> xmllint --schema .\newrelic.xsd .\newrelic.indysingh.config --noout
Element '{urn:newrelic-config}errorCollector': Missing child element(s). Expected is one of ( {urn:newrelic-config}ignoreErrors, {urn:newrelic-config}ignoreClasses, {urn:newrelic-config}ignoreMessages, {urn:newrelic-config}ignoreStatusCodes ).
Element '{urn:newrelic-config}distributedTracing': This element is not expected.
.\newrelic.indysingh.config fails to validate

I think the issue with the distributed tracing element is that it is duplicated. For the errorCollector element, I'd suggest grabbing the default config and pasting it into your config file, then just setting enabled to false, e.g.:

	<errorCollector enabled="false">
		<ignoreClasses>
			<errorClass>System.IO.FileNotFoundException</errorClass>
			<errorClass>System.Threading.ThreadAbortException</errorClass>
		</ignoreClasses>
		<ignoreStatusCodes>
			<code>401</code>
			<code>404</code>
		</ignoreStatusCodes>
	</errorCollector>

Since this config file is failing validation, the disabling of the agent might not be working as expected. At any rate, it looks like you've figured out that disabling CLR profiling entirely (CORECLR_ENABLE_PROFILING=0) is the most reliable way to prevent the New Relic agent from having any effect on your applications.

Regarding the second issue about memory growth/hoarding in message broker segments associated with AWS SQS calls:

  1. We'd like to see FINEST-level logs from your application (obviously the agent would have to actually be enabled) covering a period of time when SQS-related activity was going on. These should be shared through the support portal.
  2. Some redacted code samples or an example app showing how your application is using SQS so we can try to reproduce the memory growth issue locally would be the most helpful thing possible.

@indy-singh
Copy link
Author

For the first issue, I noticed a few things about the newrelic.config file you shared and I ran it through XML validation against our XSD file and it came up with this:
...
I think the issue with the distributed tracing element is that it is duplicated. For the errorCollector element, I'd suggest grabbing the default config and pasting it into your config file, then just setting enabled to false, e.g.:

Thanks for flagging that, that is now all fixed:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="false">
  <appSettings>
    <add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
  </appSettings>
  <service licenseKey="REDACTED"/>
  <application/>
  <log level="off" enabled="false"/>
  <allowAllHeaders enabled="true"/>
  <attributes enabled="true">
    <exclude>request.headers.cookie</exclude>
    <exclude>request.headers.authorization</exclude>
    <exclude>request.headers.proxy-authorization</exclude>
    <exclude>request.headers.x-*</exclude>
    <include>request.headers.*</include>
  </attributes>
  <transactionTracer enabled="false"/>
  <distributedTracing enabled="false"/>
  <errorCollector enabled="false">
    <ignoreClasses>
      <errorClass>System.IO.FileNotFoundException</errorClass>
      <errorClass>System.Threading.ThreadAbortException</errorClass>
    </ignoreClasses>
    <ignoreStatusCodes>
      <code>401</code>
      <code>404</code>
    </ignoreStatusCodes>
  </errorCollector>
  <browserMonitoring autoInstrument="false"/>
  <threadProfiling>
    <ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
    <ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
  </threadProfiling>
  <applicationLogging enabled="false"/>
  <utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
  <slowSql enabled="false"/>
  <codeLevelMetrics enabled="false"/>
</configuration>

And it passes xmllint too:-

[user@INDY-PC .NET Agent]$ xmllint --schema ./newrelic.xsd ./newrelic.config --noout
./newrelic.config validates

I wonder if this is because we left the appsettings as:-

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Log4Net": "Debug",
      "Microsoft": "Warning",
      "Microsoft.Hosting.Lifetime": "Information"
    }
  },
  "AllowedHosts": "*",
  "FavouriteColour": "Green",
  "NewRelic.AppName": "DEMO_Propono",
  "NewRelic.AgentEnabled": "true"
}

Which according to the documentation:-

If the agent is disabled in the local or global newrelic.config:
The NewRelic.AgentEnabled settings in appsettings.json will enable the agent.

If that is true, why is the rest of the agent (seemingly) disabled, but only that segment stuff is active?

I shall flick the appsettings.json to false and see if the memory leak disappears.

Regarding the second issue about memory growth/hoarding in message broker segments associated with AWS SQS calls:

  1. We'd like to see FINEST-level logs from your application (obviously the agent would have to actually be enabled) covering a period of time when SQS-related activity was going on. These should be shared through the support portal.
  2. Some redacted code samples or an example app showing how your application is using SQS so we can try to reproduce the memory growth issue locally would be the most helpful thing possible.

Happy to do this if toggling the appsettings.json value doesn't fix it.

Cheers,
Indy

@nr-ahemsath
Copy link
Member

If that is true, why is the rest of the agent (seemingly) disabled, but only that segment stuff is active?

I think that's because the global newrelic.config has individually disabled most of the data types that the agent would normally send: Transaction Traces, Errors, agent logs, application log data forwarding, distributed traces, etc. So it appears the agent isn't doing anything, but it still enabled. However, the instrumentation code is still being loaded into your application by the profiler and doing what it does, creating transactions and adding segment data to them. I expect that any of the following will make your memory growth issue go away:

  1. As you said, flip "NewRelic.AgentEnabled": "true" in appsettings to false. This should disable the agent.
  2. Disable the AWS.SDK instrumentation specifically, by removing the NewRelic.Providers.Wrapper.AwsSdk.Instrumentation.xml file from the agent's extensions directory.
  3. As I said yesterday, disabling CLR profiling entirely is the most positive way of making sure the agent is having no effect on your applications, so set CORECLR_ENABLE_PROFILING=0 in the environment of your apps.

Your apps will need to be restarted for any of those three changes to take effect.

Our team spent some time digging into one of the dump files you provided (thanks for those). We focused on 2025-02-09T22.42.44.0685293+00.00_DEMOIIS102_propono.blue.dmp (not for any particular reason, just to pick one). What we could see is:

  1. The application has 5 Transaction objects in the managed heap.
  2. All five Transactions are web-type transactions, probably ASP.NET Core based on the transaction names
  3. Two of the five transactions each have lists of Segment objects with length ~1.3 and ~1.1 million respectively
  4. All of those Segment objects have a MessageBrokerSegmentData object which, upon inspection, are SQS Consume operations.
  5. These transactions are very long-lived. Some Segments being held by the transaction have a RelativeStartTime (i.e. when the segment was created relative to when the transaction was created) greater than one day.
  6. These transactions also have a very high "UnitOfWorkCount", e.g. 359095. This is kind of a reverse semaphore we use for transaction to note the number of pending sub-tasks the agent is waiting to have finish before the transaction is finished and harvested.

Here are screenshots showing some of the above:

Image
Image

Overall, this indicates some kind of a bug in our agent regarding how we are handling SQS operations within an ASP.NET Core web context for your particular scenario.

While I understand that your primary concern is getting rid of the memory growth in your demo environment by following one of the three options discussed above, we would appreciate it if you could give us a redacted code sample showing how your application is interacting with SQS within a web endpoint, so that we can fix this bug. Is it creating background tasks by any chance?

@nrcventura
Copy link
Member

Based on the information in the dump file I suspect that the transactions are started by some sort of web service request. These web service requests are in turn starting up some long running background jobs. What I think is happening is the following:

  1. The New Relic agent starts a transaction for the web service request.
  2. The transaction is stored in async local storage so that it can flow with the async execution of the request.
  3. Background work is initiated using one of the typical apis to start background work.
  4. All async local data is captured by the background jobs including the transaction.
  5. Because there is a transaction available on the background jobs, the SQS instrumentation is adding segments to that transaction.
  6. The work on the background thread is incrementing the unit of work and not decrementing it, probably because the SQS segments are not ending.
  7. When the initial web service request that started the transaction ends, the transaction isn't ending because the transaction still has incomplete units of work to track (probably the SQS segments).

The code samples would help us with confirming this theory, as well as help us with determining why the SQS segments are not ending. It would also be good to know if there is any custom instrumentation being used (such as Transaction and Trace attributes, or custom xml instrumentation files).

When background work is started, our agent is unable to determine, automatically, if the work being started is meant to be a long running job, or if a transaction should or should not flow to that background job (this decision typically requires application specific knowledge). Some APIs for starting background jobs capture the current ExecutionContext (which async local is built upon), and some do not. Our agent assumes the most common scenario where tasks are short-lived, and as a result, are most likely part of the current transaction they were started from, so storing the transaction in async local allows the transaction to be available in that task. There are some scenarios where it is desirable to treat the async work as a separate transaction, and we allow custom instrumentation to be defined to opt-into that behavior..

@indy-singh
Copy link
Author

Happy to share the code we've written around SNS/SQS, might take a while to get it into a shareable state. My guess this is more to do with the SQS side of receiving / listening for messages using long polling.

As you said, flip "NewRelic.AgentEnabled": "true" in appsettings to false. This should disable the agent.

This has indeed stopped the memory leak entirely.

These transactions are very long-lived. Some Segments being held by the transaction have a RelativeStartTime (i.e. when the segment was created relative to when the transaction was created) greater than one day.
These transactions also have a very high "UnitOfWorkCount", e.g. 359095. This is kind of a reverse semaphore we use for transaction to note the number of pending sub-tasks the agent is waiting to have finish before the transaction is finished and harvested.
Is it creating background tasks by any chance?

That sounds about right. The app in question "propono" has many queues and once the app is started it listens for messages from SQS (e.g ReceiveMessageAsync) in a long living background task (each task lives for the lifetime of the application), and there is a listening task per queue.

We don't do anything fancy from memory IIRC we do something like this:-

_runningTasks[queueUrl] = Task.Run(() => ListenForMessages(queueUrl);

It would also be good to know if there is any custom instrumentation being used (such as Transaction and Trace attributes, or custom xml instrumentation files).

No custom instrumentation, no custom xml instrumentation files either.

Cheers,
Indy

@nrcventura
Copy link
Member

Using the ExecutionContext.SuppressFlow API (or alternative background job APIs that do not capture the execution context) as described in dotnet/runtime#80000, is a method that can be used to prevent the transaction created by the New Relic agent from being applied to work being done in those background jobs.

This problem is not unique to New Relic, because the same problem will happen with OpenTelemetry where the current Activity (System.Diagnostics.DiagnosticSource.Activity) which represents the current span, is stored in async local storage, which will also be captured by the Task.Run api, making any new activities/spans created and started in that background task have that captured span/activity as its parent, which may not be desired. The main difference is that the New Relic data model is based on Transactions, which is meant to summarize everything that happens during the life of a transaction. As a result, long-running transactions do not work well in this data model, because you can't see what's going on in that long running job until the entire transaction is done. That's where custom instrumentation is required in order to get more meaningful transactions that are shorter in duration that can be used to help you understand what's going on in your application.

@indy-singh
Copy link
Author

Hi,

I've attached our internal code responsible for managing our interaction with SNS/SQS under account 3110122 with case reference #00248564.

Cheers,
Indy

@nrcventura
Copy link
Member

Based on your description of the code as well as the code samples that you provided, I think that my suspicion about what is happening is correct. That is, the transaction created to track the work in the async request is being captured by the call to Task.Run in the SNS/SQS code, and as a result the transaction is being kept alive indefinitely.

There are a few ways to work around this problem, and they each have their own set of tradeoffs.

  1. Disable SQS instrumentation. One way to do that is using the newrelic.config file to disable the sqs instrumentation by following the example in feat: Allows instrumentation to be disabled from a newrelic.config file. #2250. The simplest method might be to just mark the AWSSDK.Core assembly as ignored. This will prevent the agent from instrumenting anything within the aws sdk. It won't prevent the transaction from being captured, but it will prevent all of the sqs segments from being created and using up the app's memory.
  2. Update the sqs helper library that you shared to suppress the flow of the execution context when starting the consume task.
  3. Use custom instrumentation (xml file) and refactor the code to passed to Task.Run to be a method on a class instead of a lambda method, so that the custom instrumentation file can reference the assembly name, class name, and method name to call the DetachWrapper which will remove the captured transaction from that long running task.
    <tracerFactory name="DetachWrapper">
    <match assemblyName="Serilog.Sinks.PeriodicBatching" className="Serilog.Sinks.PeriodicBatching.PeriodicBatchingSink">
    <exactMethodMatcher methodName="OnTick" />
    </match>
    </tracerFactory>
    is an example of how we use the DetachWrapper (in xml instrumentation) to handle this same problem in older versions of the Serilog PeriodicBatchingSink.

Unfortunately there isn't an automatic way for the agent to detect this type of scenario. Eventually the agent should detect that too many segments were added to a transaction and manually end it, but that doesn't solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community To tag external issues and PRs
Projects
None yet
Development

No branches or pull requests

3 participants