Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using excessive file descriptors #798

Open
ResDiaryLewis opened this issue Dec 12, 2019 · 10 comments
Open

Using excessive file descriptors #798

ResDiaryLewis opened this issue Dec 12, 2019 · 10 comments
Labels
agents:scraper All issues related to the scraping agent bug Something isn't working help-wanted All issues where people can contribute to the project

Comments

@ResDiaryLewis
Copy link

Promitor uses too many Linux file descriptors, mainly sockets. We're running Promitor on Kubernetes and this has resulted in the Node running out of file descriptors a few times, where other Pods will crash with too many open files error.

Expected Behavior

Promitor should re-use sockets or release them after use.

Actual Behavior

Promitor opens sockets and doesn't seem to release them in a timely fashion.

Steps to Reproduce the Problem

  1. Run a Promitor container, with a scrape schedule of once per minute.
  2. Attach to the container and count the open file descriptors regularly, this is an example of a container that's been running for 6 days:
    /app # lsof | wc -l
    376504
    /app # lsof | wc -l
    376845
  3. The number of open file descriptors should grow over time, nearing the limit of 810243.

Configuration

Provide insights in the configuration that you are using:

  • Configured scraping schedule: */1 * * * *
Used scraping configuration

This is our staging configuration, which scrapes less resources but still encounters the same issue.

version: v1
azureMetadata:
  tenantId:
  subscriptionId:
  resourceGroupName:
metricDefaults:
  aggregation:
    interval: "00:05:00"
  scraping:
    schedule: "*/1 * * * *"
metrics:
- name: "azure_sql_cpu_percent_average"
  description: "'cpu_percent' with aggregation 'Average'"
  resourceType: "Generic"
  labels:
    component: "sql-database"
  azureMetricConfiguration:
    metricName: "cpu_percent"
    aggregation:
      type: "Average"
  resources:
  - resourceGroupName: "groupName"
    resourceUri: "Microsoft.Sql/servers/serverName/databases/dbName"
  # 7 more databases
# 11 more sql metrics
# 6 loadbalancer metrics
# 12 redis metrics
# 10 service bus metrics

Specifications

  • Version: 1.0.0 (image tag)
  • Platform: Docker (Linux)
  • Subsystem:
@ResDiaryLewis ResDiaryLewis added the bug Something isn't working label Dec 12, 2019
@tomkerkhove tomkerkhove added this to the v1.1.0 milestone Dec 12, 2019
@tomkerkhove
Copy link
Owner

Hm that's interesting. I'll have to dig into this as I'm not sure what is causing this - Sorry for the inconvenience!

@ResDiaryLewis
Copy link
Author

Cheers Tom! I've started to take a look today and I'll probably try and spend some time on it next week too, but you'll obviously be more familiar with the code.

@tomkerkhove
Copy link
Owner

Yes but unfortunately not have the bandwidth to fix this so soon - Sorry :/

Happy to go through a PR if you come up with something! But hopefully upgrading to .NET Core 3.1 fixes this, but not there yet (#718)

Would be good to learn about the culprit indeed.

@tomkerkhove
Copy link
Owner

This might be due to a bad HttpClient not being re-used: https://aspnetmonsters.com/2016/08/2016-08-27-httpclientwrong/

Have to take a closer look though.

@tomkerkhove
Copy link
Owner

@ResDiaryLewis was the error due to an OS-level issue or did Promitor throw an exception? If so, do you still have the exception & stack trace?

@ResDiaryLewis
Copy link
Author

ResDiaryLewis commented Jan 6, 2020

@tomkerkhove it was an error emitted by several other applications which happened to be scheduled to the same Nodes as Promitor in our cluster.
Here's one example from a Prometheus Pod:

level=error ts=2019-12-13T13:56:21.008558682Z caller=runutil.go:43 msg="function failed. Retrying" err="trigger reload: reload request failed: Post http://127.0.0.1:9090/-/reload: dial tcp 127.0.0.1:9090: socket: too many open files in system"

As a stop-gap solution, we've added a CronJob to our cluster now which periodically kills Promitor.

EDIT:
Actually Promitor caught the same error during this incident too:

System.Net.Http.HttpRequestException: Too many open files in system ---> System.Net.Sockets.SocketException: Too many open files in system
   at System.Net.Sockets.Socket..ctor(AddressFamily addressFamily, SocketType socketType, ProtocolType protocolType)
   at System.Net.Sockets.DualSocketMultipleConnectAsync..ctor(SocketType socketType, ProtocolType protocolType)
   at System.Net.Sockets.Socket.ConnectAsync(SocketType socketType, ProtocolType protocolType, SocketAsyncEventArgs e)
   at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)

Some other Pods including Tiller, nginx-ingress, cert-manager, and a TeamCity agent were crashing at the same time. We determined that Promitor was the culprit using the aforementioned shell commands.

@tomkerkhove tomkerkhove modified the milestones: v1.1.0, v1.2.0, v1.3.0 Jan 6, 2020
@tomkerkhove
Copy link
Owner

tomkerkhove commented Jan 16, 2020

Thanks for the additional information and sorry for that!

Will have a look but since we don't use HttpClient directly it must be one of our dependencies so all information/telemetry is welcome.

@tomkerkhove
Copy link
Owner

Might have found something in #844 @ResDiaryLewis.

Looks like Azure SDK create a ton of HttpClients.

@ResDiaryLewis
Copy link
Author

Nice find @tomkerkhove 👍

@tomkerkhove tomkerkhove modified the milestones: v1.3.0, v1.4.0 Jan 30, 2020
@tomkerkhove tomkerkhove modified the milestones: v1.4.0, v1.6.0, v1.5.0 Mar 20, 2020
@tomkerkhove tomkerkhove modified the milestones: v1.5.0, v1.6.0, v1.7.0 Apr 6, 2020
@tomkerkhove tomkerkhove added the agents:scraper All issues related to the scraping agent label Apr 21, 2020
@tomkerkhove tomkerkhove modified the milestones: v1.7.0, v1.8.0 May 11, 2020
@tomkerkhove
Copy link
Owner

Are you still seeing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agents:scraper All issues related to the scraping agent bug Something isn't working help-wanted All issues where people can contribute to the project
Projects
Status: Proposed
Development

No branches or pull requests

2 participants