Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: disable prometheus metric processor memory #511

Merged
merged 2 commits into from
Aug 9, 2023

Conversation

cbgbt
Copy link
Contributor

@cbgbt cbgbt commented Aug 9, 2023

Issue number:
Closes #441

Description of changes:
Per #441 and testing, the gauges reported by the controller's /metrics endpoint would often report stale data, double-counting hosts as it showed the historic state alongside the current state.

This is because with_memory() was enabled, which explicitly maintains previous metrics when only new subsets are reported.

This change disables metric processor memory to make it truly stateless, which is how we were computing the metrics anyways.

In this case, prometheus will still report old state until the metrics are stale. In order to attempt to clear old state faster, we also explicitly report 0 counts for prior metrics when we can. Because the controller is supposed to be stateless and can be rescheduled, this is done as a best-effort.

Testing done:
Watched the controller /metrics endpoint with and without this change.

After the change was made, the metrics endpoint always surfaced the current state of the cluster, e.g.:

# HELP brupop_hosts_state Brupop host's state
# TYPE brupop_hosts_state gauge
brupop_hosts_state{service_name="unknown_service",state="Idle"} 2
brupop_hosts_state{service_name="unknown_service",state="RebootedIntoUpdate"} 1
brupop_hosts_state{service_name="unknown_service",state="StagedAndPerformedUpdate"} 0
# HELP brupop_hosts_version Brupop host's bottlerocket version
# TYPE brupop_hosts_version gauge
brupop_hosts_version{bottlerocket_version="1.14.0",service_name="unknown_service"} 2
brupop_hosts_version{bottlerocket_version="1.14.2",service_name="unknown_service"} 1

(Note that StagedAndPerformedUpdate has a count of 0, because the host that is now in RebootedIntoUpdate had it's prior count intentionally wiped)

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

This commit begins to move some of the metrics functionality out of the
controller implementation, which is already too multifaceted.
Our metric computations are stateless; however, processor memory causes
the metric processor to remember old gauges that are not reported each
time metrics are generated.
@cbgbt
Copy link
Contributor Author

cbgbt commented Aug 9, 2023

^ Fixed a typo in the git commit message, plus rebased onto develop.

@cbgbt cbgbt merged commit 880e401 into bottlerocket-os:develop Aug 9, 2023
2 checks passed
@cbgbt cbgbt deleted the prometheus-gauge branch August 10, 2023 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect Prometheus metric values
3 participants