Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aged summaries report as zero, but never pruned #492

Closed
babbottscott opened this issue Feb 10, 2022 · 5 comments
Closed

Aged summaries report as zero, but never pruned #492

babbottscott opened this issue Feb 10, 2022 · 5 comments

Comments

@babbottscott
Copy link

We have a chronic issue with one of our services, where over time the metric count per instance will grow to the point it overwhelms our prometheus scraper. I think I've tracked it down to stale label sets that have been aged out (zeroed), but are still reporting on the endpoint. Is there any functionality to reap stale summaries, rather than continuing to report them as 0?

@brice-morin
Copy link

We have a related issue. We collect metrics from a number of devices. Each device will typically be online for some hours, and then offline for the rest of the day. The period of activity varies from device to device. In grafana we have dashboards that summarize those metrics (basically computing avg, std, etc) so that we can have an overview of our fleet of devices. The problem is that some devices sometimes report very extreme values before they go offline. Those extreme values will be exported as long as the devices are offline, and will have some undesired side effects on the avg, std we compute in grafana.

Ideally, when a device goes offline, and metrics are not updated anymore through the prom-client, we would like the metrics for those devices not to be exposed anymore, until the device is back online and fresh metrics are produced.

There is an interesting discussion on this topic in the Prometheus documentation for how to write exporters. See section "Pushes", "Firstly, when do you expire metrics?"

Do you have any plan to provide some TTL (time to live) option or some expiry delay to stop old/stale metrics from being exposed?

@rilpires
Copy link
Contributor

rilpires commented Feb 8, 2023

Reporting web requests with path as a label is almost impractical, since there will be always some thrash routes (probably security breachs on some servers) which will never expire.

Couldn't we all agree that pruning a expired summary labelset value instead of zeroing it could be at least a configuration option ? I think it could even be the default behavior

@zbjornson
Copy link
Collaborator

There is an interesting discussion on this topic in the Prometheus documentation for how to write exporters. See section "Pushes", "Firstly, when do you expire metrics?"

I think this is only relevant to using pushgateway.

Nonetheless, I understand the use case/issue. Do other prometheus client libraries have this TTL/pruning feature?

@rilpires
Copy link
Contributor

Nonetheless, I understand the use case/issue. Do other prometheus client libraries have this TTL/pruning feature?

Not exactly with summary, but cadvisor doesn't export dead containers metrics, i.e.

@zbjornson
Copy link
Collaborator

Looks like this was fixed in #540, thanks @rilpires.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants