Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The .Net Agent alongside usage of HttpClient is causing Unknown Socket Error in the .Net runtime #897

Closed
slidecraft104 opened this issue Jan 27, 2022 · 13 comments
Labels
bug Something isn't working community To tag external issues and PRs

Comments

@slidecraft104
Copy link

slidecraft104 commented Jan 27, 2022

Description

We have a .Net 5/6 application that is experiencing an inordinate number of SocketExceptions. We use a Web API base template and implement background processing that interacts heavily with AWS resources. We are using New Relic functionality extensively throughout.

Removing any and all calls to the New Relic Agent, and all New Relic instrumentation, has alleviated the issue almost completely.

We have an issue filed directly with Microsoft that details the issue here.

Expected Behavior

The NR .Net Agent should play nice with the usage of HttpClient and the IHttpClientFactory and not contribute to, or cause SocketExceptions deep in the .Net runtime.

Troubleshooting or NR Diag results

We have gone through our code intensively to make sure we are properly using the IHttpClientFactory which uses pooling under the hood to correctly use HttpClient/HttpRequestMessage. This is able to be plugged into all of the AWS clients so they are using it properly as well. We have removed all (other) 3rd party libraries that we can determine do not properly use these patterns. The errors persist.

Since we use New Relic extensively, we refrained from removing it. Until I found this. You are apparently using WebRequest, which has been deprecated and is recommended to not be used, as it is known to cause socket exhaustion and SocketExceptions.

Once we completely removed all calls to the agent, and any references to NR namespaces and projects, the problems have ceased. Several hours of load testing has confirmed this.

Steps to Reproduce

I don't have a sample application to share with you, I can only suggest that:

  • a basic Web API project (dotnet create webapi) be created
  • some background processing implemented in IHostedServices that:
    interacts with external web resources (using GETs, PUTs, and POSTs to match our use case)
    properly uses either singleton HttpClient or the IHttpClientFactory
  • test both with and without usage of the NR agent integrated in there somewhere (preferably liberally)

The above should reproduce the error. Test targeting both .Net 5 and 6 and you should see the errors increase in frequency with .Net 6.

Your Environment

  • .Net 5/6
  • New Relic .Net agent v9.x (various versions tried)
  • Linux in containers (Ubuntu 20.04)
  • Kubernetes

Additional context

Again, reference this issue on the dotnet/runtime board for more detail. I can also answer questions directly here.

Please let me know what else I can do.

@slidecraft104 slidecraft104 added the bug Something isn't working label Jan 27, 2022
@angelatan2 angelatan2 added the community To tag external issues and PRs label Jan 28, 2022
@vuqtran88
Copy link
Contributor

@sliderhouserules
Thanks for bringing this to our attention. The next step for us is trying to repro the issue based on the information you shared.

Regarding to the usage of WebRequest, it's still unclear to me whether or it is the real issue because HttpClient is used under the hood in .Net Core, so if we switched to use HttpClient, i think this issue would still occur.

@JcolemanNR JcolemanNR self-assigned this May 5, 2022
@JcolemanNR
Copy link
Contributor

Hello @sliderhouserules,

Sorry we still haven't gotten around to looking at this. This is still on our radar, and we will spend some time attempting to repro this in the next few weeks.

Can you provide an update on you current thinking surrounding this issue?

Do you still believe that New Relic is causing this issue to occur? I reviewed the issue you filed with Microsoft, and it sounds like they suspect a root cause related to File handle access?

@slidecraft104
Copy link
Author

Yeah, I've been working with MS, but that root cause that they discovered for the other person that had a similar issue/symptoms isn't my issue, I don't think. When I remove all New Relic calls from my application, the errors go away. When I put the calls back in, the SocketException occurs.

I will be able to get back to testing this and attempting to solve it soon, and would love your assistance. But for now, I am not able to do anything with this due to other higher priority items in my work queue.

If you get to it in the meantime, I can try to provide more information as you need it.

@workato-integration
Copy link

@angelatan2
Copy link
Contributor

angelatan2 commented Dec 6, 2022

From @vuqtran88:
One of the approach to address this issue without a reproduction is to replace WebRequest with HttpClient and get feedback from customers:

The main drivers for this change are these issues #897, and #1202 that were reported. There weren’t definitive conclusions about whether or not using WebRequest directly caused these reported issues, but my recommendation is we should replace WebRequest with HttpClient. The reasons are:
Microsoft has encouraged developers to move away from WebRequest and use HttpClient for new developments. In fact, MSFT marked WebRequest as deprecated in .NET 6.0 . Even though MSFT still keeps WebRequest for .NET FW developments, but I believe it is mainly for backward compatibility purposes.

For development/testing purposes only: https://github.com/newrelic/newrelic-dotnet-agent/tree/use-httpclient-proto

@workato-integration
Copy link

Jira CommentId: 124228
Commented by ahemsath:

[~angelatan]'s comment above is a copy-paste of my standup status from yesterday morning.

As of this morning, the only remaining known problem with using HTTPClient instead of WebRequest for agent->collector communications is the issue with SSL certificates in the integration tests that use the "MockNewRelic" facility. As detailed above, I have some ideas on how to fix this that I plan to try today.

@workato-integration
Copy link

Jira CommentId: 125225
Commented by ahemsath:

Update: I got the integration tests that use MockNewRelic to pass using the default development SSL certificate (see https://learn.microsoft.com/en-us/dotnet/core/tools/dotnet-dev-certs).

CI updated to get the dev cert trusted before running integration tests. After doing this I was able to get a completely green CI run: https://github.com/newrelic/newrelic-dotnet-agent/actions/runs/3698709592

Remaining tasks:
Run integration tests on Linux and make sure those pass.
Limited performance testing (just for sanity checking purposes).

@workato-integration
Copy link

workato-integration bot commented Dec 15, 2022

Jira CommentId: 125261
Commented by ahemsath:

Linux integration testing results: Failed! - Failed: 8, Passed: 311, Skipped: 0, Total: 319, Duration: 1 h 4 m

The 8 failures are explained by:

  1. ThreadProfileNetCore 5/6 failing because I didn't take the time to figure out how to trust the dev SSL cert on Ubuntu (it's not as easy as a single command like it is on Windows)
  2. log4net logging tests that use log4net version 2.0.10 failing because of this bug: https://issues.apache.org/jira/browse/LOG4NET-652

I'm satisfied that the agent's use of HTTPClient works on Linux as well as on Windows.

@workato-integration
Copy link

Jira CommentId: 128176
Commented by ahemsath:

Updating status: there is a PR in review for this bug.  We still need to do performance testing before we can merge.  We have not been able to get this done due to the holiday and resource limitations.  We hope to get this done, merged and released soon.

@workato-integration
Copy link

Jira CommentId: 129289
Commented by ahemsath:

Status update: same as last week, but I have dedicated time this week to work on getting this one across the finish line.  Performance testing and ad-hoc proxy testing will be necessary.

@workato-integration
Copy link

Jira CommentId: 130647
Commented by ahemsath:

Proxy testing update:

  1. Overall the agent appears to be working fine when using the agent's explicit proxy configuration.  I have Squid running on my dev system, listening on the default port of 3128.  When I set these environment variables: NEW_RELIC_PROXY_HOST="localhost" and NEW_RELIC_PROXY_PORT="3128", the agent (built from the branch with the changes that replace WebRequest with HTTPClient) is working fine with a simple .NET 6.0 test application.  By working fine, I mean that it connects to New Relic, there are no errors in the agent log file, and the data shows up in New Relic that I'm expecting to see for the test app in question.
  2. I wanted to make really sure that all the traffic was going through the proxy (it's hard to convince yourself of this just by looking at the Squid access log, as it takes a minute or more for anything to be logged after the agent starts and connects to NR).  I decided to use Wireshark to make sure.  Since Wireshark doesn't have the ability to filter traffic by process ID or name, I needed to set up a proxy on another host.  I used the Windows performance testing NUC.  I installed Squid and Wireshark on that system, and changed my NEW_RELIC_PROXY_HOST setting to be the NUC's IP address.  By running Wireshark on both my development system and the NUC, I can see that a) no traffic is being sent from my system to any 162.0.0.0/8 IP address (which is where the NR collectors are) and b) traffic to 162.0.0.0/8 shows up on the NUC as soon as the agent starts on my development system.  From this I'm convinced the proxy setting is working as intended.
  3. I also wanted to make sure that the built-in Windows system proxy settings would also work as expected.  I ran the same test as in number 2 above, but instead of setting NEW_RELIC_PROXY_HOST/PORT, I set HTTP_PROXY and HTTPS_PROXY to point to the Squid instance on the NUC.  I got the same result, which is good: agent works as expected, no traffic to NR from my dev box where the agent is running, traffic seen to NR on the NUC where the proxy server is.

Based on these results, I'm concluding that proxy support is working as expected with the agent using HTTPClient instead of WebRequest.

As a bonus, this testing shows pretty convincingly that the switch to HTTPClient is doing one of the things that we want and expect it to, which is reducing the number of connections created/destroyed by the agent as it operates.  On the PR branch, the Squid proxy log for an agent running for several minutes (and therefore sending multiple data payloads to NR) looks like this:

04/Jan/2023:12:12:04 -0800  74917 ::1 TCP_TUNNEL/200 4153 CONNECT staging-collector.newrelic.com:443 - HIER_DIRECT/staging-collector.newrelic.com -
04/Jan/2023:12:17:08 -0800 377082 ::1 TCP_TUNNEL_ABORTED/200 66829 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -1

On the current main branch, it looks like this:

04/Jan/2023:12:00:19 -0800    975 ::1 TCP_TUNNEL/200 4153 CONNECT staging-collector.newrelic.com:443 - HIER_DIRECT/staging-collector.newrelic.com -
04/Jan/2023:12:00:21 -0800    982 ::1 TCP_TUNNEL/200 28641 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:22 -0800    667 ::1 TCP_TUNNEL/200 463 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:32 -0800    472 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:33 -0800    429 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:37 -0800    135 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:38 -0800    142 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:43 -0800    137 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:43 -0800    139 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:48 -0800    131 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:48 -0800    138 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:53 -0800    141 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -
04/Jan/2023:12:00:53 -0800    146 ::1 TCP_TUNNEL/200 387 CONNECT collector-1.staging-collectors.newrelic.com:443 - HIER_DIRECT/collector-1.staging-collectors.newrelic.com -

etc.  

@workato-integration
Copy link

Jira CommentId: 130561
Commented by chynes:

Excellent work [~ahemsath]! Very thorough.

@workato-integration
Copy link

Work has been completed on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community To tag external issues and PRs
Projects
None yet
Development

No branches or pull requests

4 participants