Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Scalable Alertmanager #3574

Merged
merged 3 commits into from
Dec 17, 2020
Merged

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Dec 8, 2020

What this PR does:

Introduces a series of changes for scaling the Alertmanager. Given the proposal is quite lengthy, I have tried to keep it to the point on most parts.

Please do let me know if there's an area that requires expanding.

Signed-off-by: gotjosh josue@grafana.com

Checklist

  • N/A Tests updated
  • Documentation added
  • N/A CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Introduces a series of changes for scaling the Alertmanager. Given the
proposal is quite lengthy, I have tried to keep it to the point on most
parts.

Please do let me know if there's any area that requires expanding.

Signed-off-by: gotjosh <josue@grafana.com>
Signed-off-by: gotjosh <josue@grafana.com>
We can either run it as a separate service or embed it. **I propose we simply embed it**. At its core it’ll be simpler to operate. With future work making it possible to run as a separate service so that operators can scale when/if needed.


## Conclusion
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I have not talked about is sequencing - is it worth mentioning that now? My plan is to leave the current alertmanager struct untouched and create a new one can be specified via a flag. The PRs can be slices of the sections described. Keeping in mind that I can break those down as I deemed fit to prioritise simpler and less risky reviews.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in your ideas for sequencing, but seems like it could be separate from this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm prepping the first PR, I'll tag you in it so that you can get an idea of what my plan is.

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline and LGTM, thanks! I just pushed a commit to fix the image location.


![Scalable Alertmanager Architecture](/images/proposals/scalable-am.png)

**POST /api/v1/alerts (from the ruler) can go to any Alertmanager replica.** The AM distributor uses the ring to write alerts to a quorum of AM managers (reusing the existing code). We continue to use the same in-memory data structure from the upstream Alertmanager to save alerts and notify other pieces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: In this architecture the ruler will still use HTTP to forward alerts to the Alertmanager. In order to avoid multiple deliveries will the Ruler only be configured with a single address? If so does this assume the user will need to place a loadbalancer between the ruler and the am-distributor to achieve an equal distribution of traffic and ensure availability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the short answer is yes.

Just like today, you need to use a headless service (or include all the alertmanager URLs manually) to avoid load balancing traffic between Alertmanagers. You'll need to ensure you have a way to load balance traffic between the ruler <> Alertmanager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ranton256 ranton256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal LGTM. Thanks for doing this.


To achieve horizontal scalability, we need to distribute the workload among replicas of the service. We need to choose an appropriate field to use to distribute the workload. The field must be present on all the API requests to the Alertmanager service.

**We propose the sharding on Tenant ID**. The simplicity of this implementation, would allow us to get up and running relatively quickly whilst helping us validate assumptions. We intend to use the existing ring code to manage this. Other options such as tenant ID + receiver or Tenant ID + route are relatively complex as distributor components (in this case the Ruler) would need to be aware of Alertmanager configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this con is fine for the time being If the current design can support ~2000 average tenants, it could with a little handholding support a tenant 2000 times normal on by fiddling with sharding overrides, so I don't think this limitation is reason enough to not choose the simple option 1, shard on tenant id.

We can either run it as a separate service or embed it. **I propose we simply embed it**. At its core it’ll be simpler to operate. With future work making it possible to run as a separate service so that operators can scale when/if needed.


## Conclusion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in your ideas for sequencing, but seems like it could be separate from this PR.

@pracucci
Copy link
Contributor

Going to merge it. As usual, we'll address any post-merge comment.

@pracucci pracucci merged commit 48e4ae3 into cortexproject:master Dec 17, 2020
gotjosh added a commit to gotjosh/cortex that referenced this pull request Jan 19, 2021
The first part of the proposed as part of cortexproject#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>
pracucci pushed a commit that referenced this pull request Jan 19, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of #3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
tomwilkie pushed a commit to grafana/mimir that referenced this pull request Jul 13, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
Former-commit-id: 3aba107
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Aug 28, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
treid314 pushed a commit to treid314/dskit that referenced this pull request Sep 1, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 2, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 2, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 7, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 7, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 8, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 8, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 8, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 8, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 9, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request Sep 9, 2021
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
aknuds1 pushed a commit to grafana/dskit that referenced this pull request May 26, 2023
* Alertmanager: Allow sharding of alertmanager tenants

The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component.

Signed-off-by: gotjosh <josue@grafana.com>

* Appease the linter

Signed-off-by: gotjosh <josue@grafana.com>

* Update CHANGELOG to warn about Alertmanager sharding

Signed-off-by: gotjosh <josue@grafana.com>

* Fix, last typo.

Signed-off-by: gotjosh <josue@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants