-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade upstream Prometheus: add keep_firing_for support to alerting rules and track 499 on canceled requests #4099
Conversation
Makefile
Outdated
@@ -327,7 +327,14 @@ lint: check-makefiles | |||
# Note that we don't automatically suggest replacing sort.Float64s() with slices.Sort() as the documentation for slices.Sort() | |||
# at the time of writing warns that slices.Sort() may not correctly handle NaN values. | |||
faillint -paths \ | |||
"sort.{Strings,Ints}=golang.org/x/exp/slices.Sort" \ | |||
"sort.{Strings,Ints}=golang.org/x/exp/slices.{Sort}" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: it wasn't working as expected. Functions must be wrapped by {}
.
@@ -188,19 +188,25 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) { | |||
am.state = newReplicatedStates(cfg.UserID, cfg.ReplicationFactor, cfg.Replicator, cfg.Store, am.logger, am.registry) | |||
am.persister = newStatePersister(cfg.PersisterConfig, cfg.UserID, am.state, cfg.Store, am.logger, am.registry) | |||
|
|||
am.wg.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: moved below because this was used to track the maintenance goroutine which we're not responsible to start.
am.nflog, err = nflog.New( | ||
nflog.WithRetention(cfg.Retention), | ||
nflog.WithSnapshot(filepath.Join(cfg.TenantDataDir, notificationLogSnapshot)), | ||
nflog.WithMaintenance(maintenancePeriod, am.stop, am.wg.Done, nil), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: this option doesn't exist anymore because now you're expected to start the maintenance goroutine (done below).
Labels labels.Labels `json:"labels"` | ||
Annotations labels.Labels `json:"annotations"` | ||
State string `json:"state"` | ||
ActiveAt *time.Time `json:"activeAt,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: added omitempty
to keep parity with Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with comments.
Is there any chance to get the shardfunc onto Prometheus main? I feel we need to review any update to tsdb from now to check if it hardcodes using labels.Hash
@@ -69,4 +69,6 @@ message AlertStateDesc { | |||
[(gogoproto.nullable) = false, (gogoproto.stdtime) = true]; | |||
google.protobuf.Timestamp valid_until = 9 | |||
[(gogoproto.nullable) = false, (gogoproto.stdtime) = true]; | |||
google.protobuf.Timestamp keep_firing_since = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] not sure if it's a problem but the generated definition does not have the omittempty flag in the json tag, due nullable=false I think. For both this and active_at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use JSON here. The exported JSON is based on the other manually defined struct
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other manually defined struct
This is what we expose through the API:
https://github.com/grafana/mimir/pull/4099/files#diff-eee58232370c5c7c01c7f2ab7f27cbc4371ca8d99eaf5b180e69d8f1ac57f2bbR53
We should talk to @bboreham about it. |
a53745f
to
c03d3b1
Compare
Comment on Slack from @colega:
I remember we have for distributors but I don't remember for compactor. I will check it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for doing the lord's work on implementing keep_firing_for
so quickly.
38b7586
to
7647c66
Compare
I confirm we have one for distributors: mimir/pkg/distributor/distributor_test.go Line 1306 in a9a30e0
|
ebfce8b
to
304b184
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for the tests.
Please fix this comment before merging:
mimir/pkg/querier/error_translate_queryable.go
Lines 25 to 26 in 304b184
// promql.ErrQueryCanceled, mapped to 503 | |
// promql.ErrQueryTimeout, mapped to 503 |
…rules and track 499 on canceled requests Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
304b184
to
f0978fc
Compare
Signed-off-by: Marco Pracucci <marco@pracucci.com>
What this PR does
This PR updates the upstream Prometheus. The user facing changes are:
keep_firing_for
support to alerting rulescortex_request_duration_seconds_count
for canceled queriesNote to reviewers
labels.StableHash()
function and all new tests failed.keep_firing_for
and looks working as expected. I've also manually tested the API endpoints/api/v1/rules
and/api/v1/alerts
: they both looks working as Prometheus ones.What has changed in updated vendored libs
Alertmanager changes:
github.com/hashicorp/golang-lru/v2
UserKeyFile
andTokenFile
nflog.Log
options and maintenance (commit)Prometheus changes:
KeepFiringFor
toRule
andAlertingRule
, andKeepFiringSince
toAlert
. The 'keep_firing_for' field specifies the minimum amount of time that an alert should remain firing, even if the expression does not return any results.main
, so didn't review changes in depth)Checked the diff and AWS SDK just added new endpoints:
Which issue(s) this PR fixes or relates to
Fixes #4071
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]