-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should Be Able to Specify that Prometheus Metrics Never Expire #7137
Comments
I can add some context for why this is the way it is that might help a future investigator. This is a problem with all HashiCorp tools (at least as far as I know) as they are instrumented using The expiration time seems to have been an attempt when initially writing the Prometheus sink to match the statsd/graphite expectation that you can throw whatever dynamic metrics in at runtime without declaring them upfront and they will be flushed within a few seconds so no memory consumption to worry about. Prometheus on the other hand expects all metrics to be declared up front and to remain static during the life of the program (more or less). In Consul we partly fixed this by adding declarations for at least a subset of the core metrics and updating the Prometheus sink to never expire those metrics. But we didn't go as far as disallowing undefined metrics mostly because many layers of HashiCorp's libraries (e.g. Raft and Serf and Memberlist) all use go-metrics and all don't pre-defined their metrics so there is a risk that we'd miss a case and loose metrics or need to refactor all of those call sites. Summary: we require a TTL and expire metrics specifically because I'd really love to change this. The ideal fix for this would be to switch all of HashiCorp's products and libraries to instrumenting their code with a more modern library (either the Prometheus client directly or something else like OTEL?). The biggest challenges to that are:
More realistically, Vault could potentially adopt a similar approach to Consul of starting to define metrics up front so at least those don't disappear. We could then optionally not need a retention time set when configuring Prometheus metrics and pick a default short one just in case other metrics creep through. We would probably still want it to be explicitly enabled since it uses memory and exposes additional APIs that users of other metrics sinks will not want. |
Is there any update on this? |
Is your feature request related to a problem? Please describe.
Many monitoring and alerting systems have difficulty handling null (not zero) values or metrics that disappear and reappear. To avoid these problems, its useful to publish all the metrics one can export and if there's no data to be associated assign a zero value.
Vault's handling of metrics is principally event based, which means that metrics only appear for scraping when they exist and then will disappear when they expire. For example, the metric vault_core_leadership_lost_sum will appear once leadership is lost in the cluster and will then disappear when the retention period has been reached.
Additionally, prometheus scraping cannot be turned unless a non-zero expiration time is passed in via the telemetry config. Interestingly, by passing in an retention time, enables prometheus metrics, which is non-obvious.
Describe the solution you'd like
Describe alternatives you've considered
Are alternative is creating a proxy that maintains the values for us persistently.
The text was updated successfully, but these errors were encountered: