Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Be Able to Specify that Prometheus Metrics Never Expire #7137

Open
jaloren opened this issue Jul 17, 2019 · 2 comments
Open

Should Be Able to Specify that Prometheus Metrics Never Expire #7137

jaloren opened this issue Jul 17, 2019 · 2 comments
Labels
community-sentiment Tracking high-profile issues from the community core/metric core/telemetry enhancement

Comments

@jaloren
Copy link
Contributor

jaloren commented Jul 17, 2019

Is your feature request related to a problem? Please describe.
Many monitoring and alerting systems have difficulty handling null (not zero) values or metrics that disappear and reappear. To avoid these problems, its useful to publish all the metrics one can export and if there's no data to be associated assign a zero value.

Vault's handling of metrics is principally event based, which means that metrics only appear for scraping when they exist and then will disappear when they expire. For example, the metric vault_core_leadership_lost_sum will appear once leadership is lost in the cluster and will then disappear when the retention period has been reached.

Additionally, prometheus scraping cannot be turned unless a non-zero expiration time is passed in via the telemetry config. Interestingly, by passing in an retention time, enables prometheus metrics, which is non-obvious.

Describe the solution you'd like

  • whatever metrics vault can generate should always be exported when a scrape occurs
  • users should be enable the prometheus endpoint without expiring any metrics
  • nice-to-have: decouple enabling prometheus from specifying the retention time.

Describe alternatives you've considered
Are alternative is creating a proxy that maintains the values for us persistently.

@jaloren jaloren changed the title Should Be Able to Specify Prometheus Metrics Never Expire Should Be Able to Specify that Prometheus Metrics Never Expire Jul 17, 2019
@heatherezell heatherezell added community-sentiment Tracking high-profile issues from the community core/telemetry and removed community-sentiment Tracking high-profile issues from the community labels Jan 14, 2022
@banks
Copy link
Member

banks commented Jul 21, 2023

I can add some context for why this is the way it is that might help a future investigator.

This is a problem with all HashiCorp tools (at least as far as I know) as they are instrumented using hashicorp/go-metrics which started as a statsd wrapper and eventually got support for Prometheus too. As such, it's semantics don't really work with Prometheus perfectly.

The expiration time seems to have been an attempt when initially writing the Prometheus sink to match the statsd/graphite expectation that you can throw whatever dynamic metrics in at runtime without declaring them upfront and they will be flushed within a few seconds so no memory consumption to worry about. Prometheus on the other hand expects all metrics to be declared up front and to remain static during the life of the program (more or less).

In Consul we partly fixed this by adding declarations for at least a subset of the core metrics and updating the Prometheus sink to never expire those metrics.

But we didn't go as far as disallowing undefined metrics mostly because many layers of HashiCorp's libraries (e.g. Raft and Serf and Memberlist) all use go-metrics and all don't pre-defined their metrics so there is a risk that we'd miss a case and loose metrics or need to refactor all of those call sites.

Summary: we require a TTL and expire metrics specifically because go-metrics API was born in the pre-prometheus era where metrics were treated as ephemeral rather than static pre-defined data types. Just assuming all metrics are static might be safe, or might leak memory depending on how individual metrics are used. Even if safe, it still means metrics won't be scraped until they have be observed at least once unless we make the changes to actually pre-define all metrics up front in Vault.

I'd really love to change this. The ideal fix for this would be to switch all of HashiCorp's products and libraries to instrumenting their code with a more modern library (either the Prometheus client directly or something else like OTEL?). The biggest challenges to that are:

  1. It's a lot of work
  2. It would be backwards incompatible with existing metrics sinks and configs which could cause issues for lots of users. Most other projects choose one exposition format (e.g. Prometheus) and relied on third party solutions like Teleport for folks that wanted to push to other backends. HashiCorp tools however support a wide variety of metrics "sinks" out of the box which is convenient but also makes it much harder to change how we instrument.
  3. Ideally we'd change all of HashiCorp's tools to be consistent which makes it even more work!

More realistically, Vault could potentially adopt a similar approach to Consul of starting to define metrics up front so at least those don't disappear.

We could then optionally not need a retention time set when configuring Prometheus metrics and pick a default short one just in case other metrics creep through. We would probably still want it to be explicitly enabled since it uses memory and exposes additional APIs that users of other metrics sinks will not want.

@ameflorenti
Copy link

Is there any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-sentiment Tracking high-profile issues from the community core/metric core/telemetry enhancement
Projects
None yet
Development

No branches or pull requests

7 participants