Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Metrics scraper pod overloads and crashes #38

Open
Joseph-Goergen opened this issue Jan 28, 2021 · 14 comments
Open

Metrics scraper pod overloads and crashes #38

Joseph-Goergen opened this issue Jan 28, 2021 · 14 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@Joseph-Goergen
Copy link

We have a cluster here that has 34 nodes and 2497 pod. The metrics scraper seemed to reach 5000m of cpu and 6.7G of memory before eventually crashing.

  • Dashboard version v2.0.5
  • Metric scraper version v1.0.6

The metrics scraper produces roughly 500000 log lines per hour and look like this

Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 874 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 875 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 878 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 888 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 892 "" "dashboard/v2.0.5"
Jan 27 20:07:24 dashboard-metrics-scraper-5cccbddcc-fpr6k dashboard-metrics-scraper 172.30.160.74 - - [27/Jan/2021:18:07:24 +0000] "GET /api/v1/dashboard/nodes/<node>/metrics/cpu/usage_rate HTTP/1.1" 200 891 "" "dashboard/v2.0.5" 

It seems like it's handling the requests like it should, it's just getting overloaded and can't handle that very well. I think adding a cpu and memory limit wouldn't help very much because I think that also causes the pod to keep crashing once it hits it.
This is about as much info as I have about it on this cluster. The user did delete the pod and it came back and overloaded and crashed again.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 28, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Joseph-Goergen
Copy link
Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@Joseph-Goergen: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Jun 28, 2021
@Joseph-Goergen
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 28, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2021
@Joseph-Goergen
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2021
@Joseph-Goergen
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2022
@maciaszczykm
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants