Using excessive file descriptors #798

ResDiaryLewis · 2019-12-12T09:09:07Z

Promitor uses too many Linux file descriptors, mainly sockets. We're running Promitor on Kubernetes and this has resulted in the Node running out of file descriptors a few times, where other Pods will crash with too many open files error.

Expected Behavior

Promitor should re-use sockets or release them after use.

Actual Behavior

Promitor opens sockets and doesn't seem to release them in a timely fashion.

Steps to Reproduce the Problem

Run a Promitor container, with a scrape schedule of once per minute.
Attach to the container and count the open file descriptors regularly, this is an example of a container that's been running for 6 days:
```
/app # lsof | wc -l
376504
/app # lsof | wc -l
376845
```
The number of open file descriptors should grow over time, nearing the limit of 810243.

Configuration

Provide insights in the configuration that you are using:

Configured scraping schedule: */1 * * * *

Used scraping configuration

This is our staging configuration, which scrapes less resources but still encounters the same issue.

version: v1
azureMetadata:
  tenantId:
  subscriptionId:
  resourceGroupName:
metricDefaults:
  aggregation:
    interval: "00:05:00"
  scraping:
    schedule: "*/1 * * * *"
metrics:
- name: "azure_sql_cpu_percent_average"
  description: "'cpu_percent' with aggregation 'Average'"
  resourceType: "Generic"
  labels:
    component: "sql-database"
  azureMetricConfiguration:
    metricName: "cpu_percent"
    aggregation:
      type: "Average"
  resources:
  - resourceGroupName: "groupName"
    resourceUri: "Microsoft.Sql/servers/serverName/databases/dbName"
  # 7 more databases
# 11 more sql metrics
# 6 loadbalancer metrics
# 12 redis metrics
# 10 service bus metrics

Specifications

Version: 1.0.0 (image tag)
Platform: Docker (Linux)
Subsystem:

The text was updated successfully, but these errors were encountered:

tomkerkhove · 2019-12-12T15:16:19Z

Hm that's interesting. I'll have to dig into this as I'm not sure what is causing this - Sorry for the inconvenience!

ResDiaryLewis · 2019-12-12T15:21:24Z

Cheers Tom! I've started to take a look today and I'll probably try and spend some time on it next week too, but you'll obviously be more familiar with the code.

tomkerkhove · 2019-12-12T15:24:01Z

Yes but unfortunately not have the bandwidth to fix this so soon - Sorry :/

Happy to go through a PR if you come up with something! But hopefully upgrading to .NET Core 3.1 fixes this, but not there yet (#718)

Would be good to learn about the culprit indeed.

tomkerkhove · 2020-01-05T10:34:06Z

This might be due to a bad HttpClient not being re-used: https://aspnetmonsters.com/2016/08/2016-08-27-httpclientwrong/

Have to take a closer look though.

tomkerkhove · 2020-01-06T07:26:13Z

@ResDiaryLewis was the error due to an OS-level issue or did Promitor throw an exception? If so, do you still have the exception & stack trace?

ResDiaryLewis · 2020-01-06T07:47:07Z

@tomkerkhove it was an error emitted by several other applications which happened to be scheduled to the same Nodes as Promitor in our cluster.
Here's one example from a Prometheus Pod:

level=error ts=2019-12-13T13:56:21.008558682Z caller=runutil.go:43 msg="function failed. Retrying" err="trigger reload: reload request failed: Post http://127.0.0.1:9090/-/reload: dial tcp 127.0.0.1:9090: socket: too many open files in system"

As a stop-gap solution, we've added a CronJob to our cluster now which periodically kills Promitor.

EDIT:
Actually Promitor caught the same error during this incident too:

System.Net.Http.HttpRequestException: Too many open files in system ---> System.Net.Sockets.SocketException: Too many open files in system
   at System.Net.Sockets.Socket..ctor(AddressFamily addressFamily, SocketType socketType, ProtocolType protocolType)
   at System.Net.Sockets.DualSocketMultipleConnectAsync..ctor(SocketType socketType, ProtocolType protocolType)
   at System.Net.Sockets.Socket.ConnectAsync(SocketType socketType, ProtocolType protocolType, SocketAsyncEventArgs e)
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)

Some other Pods including Tiller, nginx-ingress, cert-manager, and a TeamCity agent were crashing at the same time. We determined that Promitor was the culprit using the aforementioned shell commands.

tomkerkhove · 2020-01-16T10:35:23Z

Thanks for the additional information and sorry for that!

Will have a look but since we don't use HttpClient directly it must be one of our dependencies so all information/telemetry is welcome.

tomkerkhove · 2020-01-18T20:02:10Z

Might have found something in #844 @ResDiaryLewis.

Looks like Azure SDK create a ton of HttpClients.

ResDiaryLewis · 2020-01-20T10:57:33Z

Nice find @tomkerkhove 👍

tomkerkhove · 2022-09-03T12:35:21Z

Are you still seeing this?

ResDiaryLewis added the bug Something isn't working label Dec 12, 2019

tomkerkhove added this to the v1.1.0 milestone Dec 12, 2019

tomkerkhove modified the milestones: v1.1.0, v1.2.0, v1.3.0 Jan 6, 2020

tomkerkhove mentioned this issue Jan 18, 2020

Provide better usage of AzureMonitorClient #844

Merged

tomkerkhove modified the milestones: v1.3.0, v1.4.0 Jan 30, 2020

tomkerkhove modified the milestones: v1.4.0, v1.6.0, v1.5.0 Mar 20, 2020

tomkerkhove modified the milestones: v1.5.0, v1.6.0, v1.7.0 Apr 6, 2020

tomkerkhove mentioned this issue Apr 18, 2020

[BUG] SDK creates a lot of HttpClient causing connection starvation Azure/azure-libraries-for-net#1046

Open

3 tasks

tomkerkhove added the agents:scraper All issues related to the scraping agent label Apr 21, 2020

tomkerkhove modified the milestones: v1.7.0, v1.8.0 May 11, 2020

azure-sdk mentioned this issue Sep 24, 2020

[BUG] SDK creates a lot of HttpClient causing connection starvation Azure/azure-sdk-for-net#1046

Closed

3 tasks

tomkerkhove modified the milestones: Scraper - v2.1.0, Scraper - v2.2.0 Jan 15, 2021

tomkerkhove modified the milestones: Scraper - v2.2.0, Scraper - v2.3.0 Jan 24, 2021

tomkerkhove removed this from the Scraper - v2.3.0 milestone Mar 11, 2021

tomkerkhove added the help-wanted All issues where people can contribute to the project label Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using excessive file descriptors #798

Using excessive file descriptors #798

ResDiaryLewis commented Dec 12, 2019

tomkerkhove commented Dec 12, 2019

ResDiaryLewis commented Dec 12, 2019

tomkerkhove commented Dec 12, 2019

tomkerkhove commented Jan 5, 2020

tomkerkhove commented Jan 6, 2020

ResDiaryLewis commented Jan 6, 2020 •

edited

Loading

tomkerkhove commented Jan 16, 2020 •

edited

Loading

tomkerkhove commented Jan 18, 2020

ResDiaryLewis commented Jan 20, 2020

tomkerkhove commented Sep 3, 2022

Using excessive file descriptors #798

Using excessive file descriptors #798

Comments

ResDiaryLewis commented Dec 12, 2019

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Configuration

Specifications

tomkerkhove commented Dec 12, 2019

ResDiaryLewis commented Dec 12, 2019

tomkerkhove commented Dec 12, 2019

tomkerkhove commented Jan 5, 2020

tomkerkhove commented Jan 6, 2020

ResDiaryLewis commented Jan 6, 2020 • edited Loading

tomkerkhove commented Jan 16, 2020 • edited Loading

tomkerkhove commented Jan 18, 2020

ResDiaryLewis commented Jan 20, 2020

tomkerkhove commented Sep 3, 2022

ResDiaryLewis commented Jan 6, 2020 •

edited

Loading

tomkerkhove commented Jan 16, 2020 •

edited

Loading