-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alertmanager: Replicate state using the Ring #3839
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,6 +46,11 @@ type alertmanagerMetrics struct { | |
|
||
// The alertmanager config hash. | ||
configHashValue *prometheus.Desc | ||
|
||
partialMerges *prometheus.Desc | ||
partialMergesFailed *prometheus.Desc | ||
replicationTotal *prometheus.Desc | ||
replicationFailed *prometheus.Desc | ||
} | ||
|
||
func newAlertmanagerMetrics() *alertmanagerMetrics { | ||
|
@@ -147,6 +152,22 @@ func newAlertmanagerMetrics() *alertmanagerMetrics { | |
"cortex_alertmanager_config_hash", | ||
"Hash of the currently loaded alertmanager configuration.", | ||
[]string{"user"}, nil), | ||
partialMerges: prometheus.NewDesc( | ||
"cortex_alertmanager_partial_state_merges_total", | ||
"Number of times we have received a partial state to merge for a key.", | ||
[]string{"key"}, nil), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't have user as part of these labels because There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have per-user registries, so we could do per-user output, aggregating over all keys. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is that different from the current approach? An aggregation across all keys is not possible if the keys are per user e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand it correctly, we have per-user registries, which only use "key" label. During |
||
partialMergesFailed: prometheus.NewDesc( | ||
"cortex_alertmanager_partial_state_merges_failed_total", | ||
"Number of times we have failed to merge a partial state received for a key.", | ||
[]string{"key"}, nil), | ||
replicationTotal: prometheus.NewDesc( | ||
"cortex_alertmanager_state_replication_total", | ||
"Number of times we have tried to replicate a state to other alertmanagers", | ||
[]string{"key"}, nil), | ||
replicationFailed: prometheus.NewDesc( | ||
"cortex_alertmanager_state_replication_failed_total", | ||
"Number of times we have failed to replicate a state to other alertmanagers", | ||
[]string{"key"}, nil), | ||
} | ||
} | ||
|
||
|
@@ -155,7 +176,7 @@ func (m *alertmanagerMetrics) addUserRegistry(user string, reg *prometheus.Regis | |
} | ||
|
||
func (m *alertmanagerMetrics) removeUserRegistry(user string) { | ||
// We neeed to go for a soft deletion here, as hard deletion requires | ||
// We need to go for a soft deletion here, as hard deletion requires | ||
// that _all_ metrics except gauges are per-user. | ||
m.regs.RemoveUserRegistry(user, false) | ||
} | ||
|
@@ -185,6 +206,10 @@ func (m *alertmanagerMetrics) Describe(out chan<- *prometheus.Desc) { | |
out <- m.silencesPropagatedMessagesTotal | ||
out <- m.silences | ||
out <- m.configHashValue | ||
out <- m.partialMerges | ||
out <- m.partialMergesFailed | ||
out <- m.replicationTotal | ||
out <- m.replicationFailed | ||
} | ||
|
||
func (m *alertmanagerMetrics) Collect(out chan<- prometheus.Metric) { | ||
|
@@ -218,4 +243,9 @@ func (m *alertmanagerMetrics) Collect(out chan<- prometheus.Metric) { | |
data.SendSumOfGaugesPerUserWithLabels(out, m.silences, "alertmanager_silences", "state") | ||
|
||
data.SendMaxOfGaugesPerUser(out, m.configHashValue, "alertmanager_config_hash") | ||
|
||
data.SendSumOfCountersWithLabels(out, m.partialMerges, "alertmanager_partial_state_merges_total", "key") | ||
data.SendSumOfCountersWithLabels(out, m.partialMergesFailed, "alertmanager_partial_state_merges_failed_total", "key") | ||
data.SendSumOfCountersWithLabels(out, m.replicationTotal, "alertmanager_state_replication_total", "key") | ||
data.SendSumOfCountersWithLabels(out, m.replicationFailed, "alertmanager_state_replication_failed_total", "key") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitReady()
is a blocking call, and should take context as argument so that caller can cancel/timeout waiting if needed. That also implies returning error to communicate success.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation for
WaitReady
andSettle
are part of the next PR, if you don't mind I'd like to leave it out of this PR for now.