Alertmanager: Replicate state using the Ring #3839

gotjosh · 2021-02-18T20:45:51Z

What this PR does:

Alertmanager typically uses the memberlist gossip based protocol to
replicate state across replicas. In cortex, we used the same fundamentals
to provide some sort of high availability mode.

Now that we have support for sharding instances across many machines, we
can leverage the ring to find the corresponding instances and send the
updates via gRPC.

This is part of the proposal #3574

And follows up #3664 , #3671

Marking it as a draft as I have a few todos left that I need to address but the logic is pretty much what you see on the tin at the moment.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pkg/alertmanager/multitenant.go

pkg/alertmanager/state_replication.go

pkg/alertmanager/alertmanager.go

pkg/alertmanager/multitenant.go

pkg/alertmanager/alertmanager.go

pkg/alertmanager/multitenant.go

jtlisi · 2021-02-25T15:41:50Z

Fixes #2650

pkg/alertmanager/multitenant.go

pkg/alertmanager/state_replication.go

pracucci · 2021-03-02T15:24:50Z

pkg/alertmanager/alertmanager.go

@@ -189,14 +239,16 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) {
 	}

 	am.dispatcherMetrics = dispatch.NewDispatcherMetrics(am.registry)
+
+	am.state.WaitReady()


Two questions:

Why is it needed now, while wasn't before?

New() is called while taking the am.alertmanagersMtx.Lock() in MultitenantAlertmanager.setConfig(). What are the implications of waiting for ready? How long could take to before ready?

Generally speaking I believe it's a bad design "waiting for something" in a constructor function (like this New()) but I would like to better understand why we need it.

We may remove WaitReady() call from this PR to unblock it and address it separately.

Generally speaking I believe it's a bad design "waiting for something" in a constructor function (like this New()) but I would like to better understand why we need it.

100%.

(Personally I would like to see this to follow our Services model, but I am not sure it makes sense here.)

pkg/alertmanager/multitenant.go

pkg/alertmanager/alertmanager.go

pstibrany · 2021-03-02T16:56:17Z

pkg/alertmanager/alertmanager.go

+type State interface {
+	AddState(string, cluster.State, prometheus.Registerer) cluster.ClusterChannel
+	Position() int
+	WaitReady()


WaitReady() is a blocking call, and should take context as argument so that caller can cancel/timeout waiting if needed. That also implies returning error to communicate success.

The implementation for WaitReady and Settle are part of the next PR, if you don't mind I'd like to leave it out of this PR for now.

pstibrany · 2021-03-02T17:02:31Z

pkg/alertmanager/alertmanager_metrics.go

+		partialMerges: prometheus.NewDesc(
+			"cortex_alertmanager_partial_state_merges_total",
+			"Number of times we have received a partial state to merge for a key.",
+			[]string{"key"}, nil),


By using key as label, we will have at least 2 labels per user, right? (one for notifications, one for silences). Do we need so many new metrics? Would it make sense to use "user" only?

We don't have user as part of these labels because key already includes the user. It's a combination of prefix + userID. So by using key we "technically get both". Even though it breaks the nomenclature (of always using user).

We have per-user registries, so we could do per-user output, aggregating over all keys. WDYT?

Is that different from the current approach? An aggregation across all keys is not possible if the keys are per user e.g. sil:user-3 or nfl:user-2.

If I understand it correctly, we have per-user registries, which only use "key" label. During alertmanagerMetrics.Collect, when then call SendSumOfCountersWithLabels with key label. We could instead call SendSumOfCountersPerUser (eg. data.SendSumOfCountersPerUser(out, m.partialMerges, "alertmanager_partial_state_merges_total")), which would return sum(alertmanager_partial_state_merges_total) per user-registry, and then add user label to the output. I think that would be enough. WDYT?

pkg/alertmanager/multitenant_test.go

pkg/alertmanager/state_replication.go

pstibrany · 2021-03-02T18:02:41Z

pkg/alertmanager/state_replication.go

+
+// Settle waits until the alertmanagers are ready (and sets the appropriate internal state when it is).
+// The idea is that we don't want to start working" before we get a chance to know most of the notifications and/or silences.
+func (s *state) Settle(ctx context.Context, _ time.Duration) {


Settle ignores context currently. It should not do that, and should report error back if context is finished before settling has finished.

I'm leaving the implementation of WaitReady and Settle to the next PR - is it OK If we skip it for now? These two involve a full state replication and I fear we need the implementation of those to make sense of the big picture.

pstibrany · 2021-03-02T18:03:45Z

pkg/alertmanager/state_replication.go

+			}
+
+			s.stateReplicationTotal.WithLabelValues(p.Key).Inc()
+			ctx := context.Background() //TODO: I probably need a better context


Hint: If state was a services.Service (= object with lifecycle in Cortex), it would already have its own context. :)

Can you help me break this down?

I have not seen us use services within services (That's different from services with sub-services) and even if we do, at its current state it seems like overkill. It's more code to manage the service, in particular, because the State itself it's satisfied by two different things (our state or the upstream Peer) which in my eyes brings us little benefits.

I'm only talking about state being a service (and not Peer), since it runs background process and its lifecycle needs to be managed.

Another possibility is to use custom context initialized in new, and canceled when stopping.

pstibrany · 2021-03-02T18:04:19Z

pkg/alertmanager/state_replication.go

+		partialStateMergesTotal: promauto.With(r).NewCounterVec(prometheus.CounterOpts{
+			Name: "alertmanager_partial_state_merges_total",
+			Help: "Number of times we have received a partial state to merge for a key.",
+		}, []string{"key"}),


Do we need per-key metrics or would per-user (i.e. per-state) be enough?

per-key includes both user and state so it would always be at most 2 per user.

pkg/alertmanager/state_replication.go

pkg/alertmanager/alertmanager.go

pstibrany · 2021-03-03T07:30:55Z

pkg/alertmanager/alertmanager_metrics.go

+		partialMerges: prometheus.NewDesc(
+			"cortex_alertmanager_partial_state_merges_total",
+			"Number of times we have received a partial state to merge for a key.",
+			[]string{"key"}, nil),


We have per-user registries, so we could do per-user output, aggregating over all keys. WDYT?

pkg/alertmanager/multitenant.go

pstibrany · 2021-03-03T07:36:53Z

pkg/alertmanager/state_replication.go

+func (s *state) WaitReady() {
+	//TODO: At the moment, we settle in a separate go-routine (see multitenant.go as we create the Peer) we should
+	// mimic that behaviour here once we have full state replication.
+	s.Settle(context.Background(), time.Second)


We don't need to copy WaitReady from upstream. We can define our own interface with what makes sense for us, and adapt upstream type to our interface via a wrapper.

Introducing context to upstream's WaitReady seems pretty straightforward too.

gotjosh · 2021-03-03T19:26:11Z

Needs #3903

pkg/alertmanager/state_replication.go

pstibrany · 2021-03-04T16:03:49Z

pkg/alertmanager/state_replication.go

+			}
+
+			s.stateReplicationTotal.WithLabelValues(p.Key).Inc()
+			ctx := context.Background() //TODO: I probably need a better context


I'm only talking about state being a service (and not Peer), since it runs background process and its lifecycle needs to be managed.

Another possibility is to use custom context initialized in new, and canceled when stopping.

Alertmanager typically uses the memberlist gossip based protocol to replcate state across replicas. In cortex, we used the same fundamentals to provide some sort of high availability mode. Now that we have support for sharding instances across many machines, we can leverage the ring to find the corresponding instances and send the updates via gRPC. Signed-off-by: gotjosh <josue@grafana.com>

Signed-off-by: gotjosh <josue@grafana.com>

pstibrany

LGTM in current state, with caveat that sharding-support is still work in progress (don't use it yet! :)). Non-sharding code seems to be unaffected though.

* Alertmanager: Replicate state using the Ring Alertmanager typically uses the memberlist gossip based protocol to replcate state across replicas. In cortex, we used the same fundamentals to provide some sort of high availability mode. Now that we have support for sharding instances across many machines, we can leverage the ring to find the corresponding instances and send the updates via gRPC. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter and wordsmithing Signed-off-by: gotjosh <josue@grafana.com> * Always wait for the missing metrics Signed-off-by: gotjosh <josue@grafana.com>

pull-request-size bot added the size/XL label Feb 18, 2021

gotjosh force-pushed the alertmanager-replication branch from a833fdc to f2ece22 Compare February 18, 2021 20:46

gotjosh marked this pull request as draft February 18, 2021 20:46

gotjosh force-pushed the alertmanager-replication branch 6 times, most recently from da19b35 to 5e83714 Compare February 23, 2021 14:00

pracucci requested review from pracucci and pstibrany and removed request for pracucci February 23, 2021 17:17

pracucci reviewed Feb 23, 2021

View reviewed changes

gotjosh marked this pull request as ready for review February 23, 2021 20:09

gotjosh force-pushed the alertmanager-replication branch from 4ebdf3f to a67a6d7 Compare February 23, 2021 20:14

gotjosh requested a review from pracucci February 23, 2021 20:16

pracucci reviewed Feb 24, 2021

View reviewed changes

jtlisi linked an issue Feb 25, 2021 that may be closed by this pull request

Support for DNS discovery for alertmanagers to find peers in a cluster #2883

Closed

jtlisi removed a link to an issue Feb 25, 2021

Support for DNS discovery for alertmanagers to find peers in a cluster #2883

Closed

gotjosh force-pushed the alertmanager-replication branch 3 times, most recently from 08c3a32 to 27686f0 Compare February 26, 2021 16:16

jtlisi linked an issue Mar 1, 2021 that may be closed by this pull request

Reuse ring memberlist client in Alertmanager #2650

Closed

stevesg reviewed Mar 2, 2021

View reviewed changes

pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved

pkg/alertmanager/state_replication.go Show resolved Hide resolved

gotjosh force-pushed the alertmanager-replication branch 2 times, most recently from 796ab51 to e747559 Compare March 2, 2021 14:11

pracucci reviewed Mar 2, 2021

View reviewed changes

pstibrany reviewed Mar 2, 2021

View reviewed changes

gotjosh force-pushed the alertmanager-replication branch from 3c92a47 to ff857d1 Compare March 2, 2021 20:33

gotjosh force-pushed the alertmanager-replication branch from ff857d1 to dfc210a Compare March 2, 2021 20:48

pstibrany reviewed Mar 3, 2021

View reviewed changes

gotjosh force-pushed the alertmanager-replication branch 2 times, most recently from 7a640d1 to 57f6670 Compare March 3, 2021 19:03

stevesg reviewed Mar 3, 2021

View reviewed changes

pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved

pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved

gotjosh force-pushed the alertmanager-replication branch from 230147a to 884245e Compare March 4, 2021 11:39

pstibrany reviewed Mar 4, 2021

View reviewed changes

gotjosh force-pushed the alertmanager-replication branch from 884245e to 36286e5 Compare March 5, 2021 14:07

pracucci approved these changes Mar 5, 2021

View reviewed changes

Appease the linter and wordsmithing

ce7b3cb

Signed-off-by: gotjosh <josue@grafana.com>

gotjosh force-pushed the alertmanager-replication branch from a5de453 to ce7b3cb Compare March 5, 2021 17:09

Always wait for the missing metrics

82879f4

Signed-off-by: gotjosh <josue@grafana.com>

pstibrany approved these changes Mar 8, 2021

View reviewed changes

pstibrany merged commit 2dae12a into cortexproject:master Mar 8, 2021

pracucci mentioned this pull request Mar 8, 2021

TestAlertmanagerSharding is flaky due to a logic issue #3927

Closed

stevesg mentioned this pull request Mar 8, 2021

Use BasicService in state_replication.go for replication loop. #3930

Merged

3 tasks

stevesg mentioned this pull request Mar 9, 2021

Pass context to WaitReady of alertmanager Peer interface. #3931

Merged

stevesg mentioned this pull request Mar 17, 2021

Add per-user versions of alertmanager state replication metrics. #3961

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager: Replicate state using the Ring #3839

Alertmanager: Replicate state using the Ring #3839

gotjosh commented Feb 18, 2021 •

edited

Loading

jtlisi commented Feb 25, 2021

pracucci Mar 2, 2021

pracucci Mar 2, 2021

pstibrany Mar 2, 2021

pstibrany Mar 2, 2021

gotjosh Mar 3, 2021

pstibrany Mar 2, 2021

gotjosh Mar 2, 2021

pstibrany Mar 3, 2021

gotjosh Mar 3, 2021

pstibrany Mar 4, 2021

pstibrany Mar 2, 2021

gotjosh Mar 2, 2021

pstibrany Mar 2, 2021

gotjosh Mar 3, 2021

pstibrany Mar 4, 2021

pstibrany Mar 2, 2021

gotjosh Mar 2, 2021

pstibrany Mar 3, 2021

pstibrany Mar 3, 2021

gotjosh commented Mar 3, 2021

pstibrany Mar 4, 2021

pstibrany left a comment

Alertmanager: Replicate state using the Ring #3839

Alertmanager: Replicate state using the Ring #3839

Conversation

gotjosh commented Feb 18, 2021 • edited Loading

jtlisi commented Feb 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gotjosh commented Mar 3, 2021

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

gotjosh commented Feb 18, 2021 •

edited

Loading