Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager: Replicate state using the Ring #3839

Merged
merged 3 commits into from
Mar 8, 2021

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Feb 18, 2021

What this PR does:

Alertmanager typically uses the memberlist gossip based protocol to
replicate state across replicas. In cortex, we used the same fundamentals
to provide some sort of high availability mode.

Now that we have support for sharding instances across many machines, we
can leverage the ring to find the corresponding instances and send the
updates via gRPC.

This is part of the proposal #3574

And follows up #3664 , #3671

Marking it as a draft as I have a few todos left that I need to address but the logic is pretty much what you see on the tin at the moment.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@gotjosh gotjosh marked this pull request as draft February 18, 2021 20:46
@gotjosh gotjosh force-pushed the alertmanager-replication branch 6 times, most recently from da19b35 to 5e83714 Compare February 23, 2021 14:00
@pracucci pracucci requested review from pracucci and pstibrany and removed request for pracucci February 23, 2021 17:17
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
@gotjosh gotjosh marked this pull request as ready for review February 23, 2021 20:09
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
@jtlisi
Copy link
Contributor

jtlisi commented Feb 25, 2021

Fixes #2650

@gotjosh gotjosh force-pushed the alertmanager-replication branch 3 times, most recently from 08c3a32 to 27686f0 Compare February 26, 2021 16:16
@jtlisi jtlisi linked an issue Mar 1, 2021 that may be closed by this pull request
pkg/alertmanager/multitenant.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Show resolved Hide resolved
@gotjosh gotjosh force-pushed the alertmanager-replication branch 2 times, most recently from 796ab51 to e747559 Compare March 2, 2021 14:11
@@ -189,14 +239,16 @@ func New(cfg *Config, reg *prometheus.Registry) (*Alertmanager, error) {
}

am.dispatcherMetrics = dispatch.NewDispatcherMetrics(am.registry)

am.state.WaitReady()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Why is it needed now, while wasn't before?
  2. New() is called while taking the am.alertmanagersMtx.Lock()  in MultitenantAlertmanager.setConfig(). What are the implications of waiting for ready? How long could take to before ready?

Generally speaking I believe it's a bad design "waiting for something" in a constructor function (like this New()) but I would like to better understand why we need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may remove WaitReady() call from this PR to unblock it and address it separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I believe it's a bad design "waiting for something" in a constructor function (like this New()) but I would like to better understand why we need it.

100%.

(Personally I would like to see this to follow our Services model, but I am not sure it makes sense here.)

pkg/alertmanager/multitenant.go Show resolved Hide resolved
pkg/alertmanager/multitenant.go Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
type State interface {
AddState(string, cluster.State, prometheus.Registerer) cluster.ClusterChannel
Position() int
WaitReady()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitReady() is a blocking call, and should take context as argument so that caller can cancel/timeout waiting if needed. That also implies returning error to communicate success.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation for WaitReady and Settle are part of the next PR, if you don't mind I'd like to leave it out of this PR for now.

partialMerges: prometheus.NewDesc(
"cortex_alertmanager_partial_state_merges_total",
"Number of times we have received a partial state to merge for a key.",
[]string{"key"}, nil),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using key as label, we will have at least 2 labels per user, right? (one for notifications, one for silences). Do we need so many new metrics? Would it make sense to use "user" only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have user as part of these labels because key already includes the user. It's a combination of prefix + userID. So by using key we "technically get both". Even though it breaks the nomenclature (of always using user).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have per-user registries, so we could do per-user output, aggregating over all keys. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that different from the current approach? An aggregation across all keys is not possible if the keys are per user e.g. sil:user-3 or nfl:user-2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, we have per-user registries, which only use "key" label. During alertmanagerMetrics.Collect, when then call SendSumOfCountersWithLabels with key label. We could instead call SendSumOfCountersPerUser (eg. data.SendSumOfCountersPerUser(out, m.partialMerges, "alertmanager_partial_state_merges_total")), which would return sum(alertmanager_partial_state_merges_total) per user-registry, and then add user label to the output. I think that would be enough. WDYT?

pkg/alertmanager/multitenant_test.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved

// Settle waits until the alertmanagers are ready (and sets the appropriate internal state when it is).
// The idea is that we don't want to start working" before we get a chance to know most of the notifications and/or silences.
func (s *state) Settle(ctx context.Context, _ time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settle ignores context currently. It should not do that, and should report error back if context is finished before settling has finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving the implementation of WaitReady and Settle to the next PR - is it OK If we skip it for now? These two involve a full state replication and I fear we need the implementation of those to make sense of the big picture.

}

s.stateReplicationTotal.WithLabelValues(p.Key).Inc()
ctx := context.Background() //TODO: I probably need a better context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hint: If state was a services.Service (= object with lifecycle in Cortex), it would already have its own context. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me break this down?

I have not seen us use services within services (That's different from services with sub-services) and even if we do, at its current state it seems like overkill. It's more code to manage the service, in particular, because the State itself it's satisfied by two different things (our state or the upstream Peer) which in my eyes brings us little benefits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only talking about state being a service (and not Peer), since it runs background process and its lifecycle needs to be managed.

Another possibility is to use custom context initialized in new, and canceled when stopping.

partialStateMergesTotal: promauto.With(r).NewCounterVec(prometheus.CounterOpts{
Name: "alertmanager_partial_state_merges_total",
Help: "Number of times we have received a partial state to merge for a key.",
}, []string{"key"}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need per-key metrics or would per-user (i.e. per-state) be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per-key includes both user and state so it would always be at most 2 per user.

pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/alertmanager.go Outdated Show resolved Hide resolved
partialMerges: prometheus.NewDesc(
"cortex_alertmanager_partial_state_merges_total",
"Number of times we have received a partial state to merge for a key.",
[]string{"key"}, nil),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have per-user registries, so we could do per-user output, aggregating over all keys. WDYT?

pkg/alertmanager/multitenant.go Show resolved Hide resolved
func (s *state) WaitReady() {
//TODO: At the moment, we settle in a separate go-routine (see multitenant.go as we create the Peer) we should
// mimic that behaviour here once we have full state replication.
s.Settle(context.Background(), time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to copy WaitReady from upstream. We can define our own interface with what makes sense for us, and adapt upstream type to our interface via a wrapper.

Introducing context to upstream's WaitReady seems pretty straightforward too.

@gotjosh gotjosh force-pushed the alertmanager-replication branch 2 times, most recently from 7a640d1 to 57f6670 Compare March 3, 2021 19:03
@gotjosh
Copy link
Contributor Author

gotjosh commented Mar 3, 2021

Needs #3903

pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
pkg/alertmanager/state_replication.go Outdated Show resolved Hide resolved
}

s.stateReplicationTotal.WithLabelValues(p.Key).Inc()
ctx := context.Background() //TODO: I probably need a better context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only talking about state being a service (and not Peer), since it runs background process and its lifecycle needs to be managed.

Another possibility is to use custom context initialized in new, and canceled when stopping.

Alertmanager typically uses the memberlist gossip based protocol to
replcate state across replicas. In cortex, we used the same fundamentals
to provide some sort of high availability mode.

Now that we have support for sharding instances across many machines, we
can leverage the ring to find the corresponding instances and send the
updates via gRPC.

Signed-off-by: gotjosh <josue@grafana.com>
Signed-off-by: gotjosh <josue@grafana.com>
Signed-off-by: gotjosh <josue@grafana.com>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in current state, with caveat that sharding-support is still work in progress (don't use it yet! :)). Non-sharding code seems to be unaffected though.

@pstibrany pstibrany merged commit 2dae12a into cortexproject:master Mar 8, 2021
roystchiang pushed a commit to roystchiang/cortex that referenced this pull request Apr 6, 2022
* Alertmanager: Replicate state using the Ring

Alertmanager typically uses the memberlist gossip based protocol to
replcate state across replicas. In cortex, we used the same fundamentals
to provide some sort of high availability mode.

Now that we have support for sharding instances across many machines, we
can leverage the ring to find the corresponding instances and send the
updates via gRPC.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter and wordsmithing

Signed-off-by: gotjosh <josue@grafana.com>

* Always wait for the missing metrics

Signed-off-by: gotjosh <josue@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reuse ring memberlist client in Alertmanager
5 participants