-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Scalable Alertmanager #3574
Conversation
Introduces a series of changes for scaling the Alertmanager. Given the proposal is quite lengthy, I have tried to keep it to the point on most parts. Please do let me know if there's any area that requires expanding. Signed-off-by: gotjosh <josue@grafana.com>
Signed-off-by: gotjosh <josue@grafana.com>
We can either run it as a separate service or embed it. **I propose we simply embed it**. At its core it’ll be simpler to operate. With future work making it possible to run as a separate service so that operators can scale when/if needed. | ||
|
||
|
||
## Conclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I have not talked about is sequencing - is it worth mentioning that now? My plan is to leave the current alertmanager struct untouched and create a new one can be specified via a flag. The PRs can be slices of the sections described. Keeping in mind that I can break those down as I deemed fit to prioritise simpler and less risky reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested in your ideas for sequencing, but seems like it could be separate from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm prepping the first PR, I'll tag you in it so that you can get an idea of what my plan is.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline and LGTM, thanks! I just pushed a commit to fix the image location.
|
||
![Scalable Alertmanager Architecture](/images/proposals/scalable-am.png) | ||
|
||
**POST /api/v1/alerts (from the ruler) can go to any Alertmanager replica.** The AM distributor uses the ring to write alerts to a quorum of AM managers (reusing the existing code). We continue to use the same in-memory data structure from the upstream Alertmanager to save alerts and notify other pieces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: In this architecture the ruler will still use HTTP to forward alerts to the Alertmanager. In order to avoid multiple deliveries will the Ruler only be configured with a single address? If so does this assume the user will need to place a loadbalancer between the ruler
and the am-distributor
to achieve an equal distribution of traffic and ensure availability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the short answer is yes.
Just like today, you need to use a headless service (or include all the alertmanager URLs manually) to avoid load balancing traffic between Alertmanagers. You'll need to ensure you have a way to load balance traffic between the ruler <> Alertmanager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal LGTM. Thanks for doing this.
|
||
To achieve horizontal scalability, we need to distribute the workload among replicas of the service. We need to choose an appropriate field to use to distribute the workload. The field must be present on all the API requests to the Alertmanager service. | ||
|
||
**We propose the sharding on Tenant ID**. The simplicity of this implementation, would allow us to get up and running relatively quickly whilst helping us validate assumptions. We intend to use the existing ring code to manage this. Other options such as tenant ID + receiver or Tenant ID + route are relatively complex as distributor components (in this case the Ruler) would need to be aware of Alertmanager configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this con is fine for the time being If the current design can support ~2000 average tenants, it could with a little handholding support a tenant 2000 times normal on by fiddling with sharding overrides, so I don't think this limitation is reason enough to not choose the simple option 1, shard on tenant id.
We can either run it as a separate service or embed it. **I propose we simply embed it**. At its core it’ll be simpler to operate. With future work making it possible to run as a separate service so that operators can scale when/if needed. | ||
|
||
|
||
## Conclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested in your ideas for sequencing, but seems like it could be separate from this PR.
Going to merge it. As usual, we'll address any post-merge comment. |
The first part of the proposed as part of cortexproject#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of #3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com> Former-commit-id: 3aba107
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
* Alertmanager: Allow sharding of alertmanager tenants The first part of the proposed as part of cortexproject/cortex#3574, introduces sharding via the ring for the Alertmanager component. Signed-off-by: gotjosh <josue@grafana.com> * Appease the linter Signed-off-by: gotjosh <josue@grafana.com> * Update CHANGELOG to warn about Alertmanager sharding Signed-off-by: gotjosh <josue@grafana.com> * Fix, last typo. Signed-off-by: gotjosh <josue@grafana.com>
What this PR does:
Introduces a series of changes for scaling the Alertmanager. Given the proposal is quite lengthy, I have tried to keep it to the point on most parts.
Please do let me know if there's an area that requires expanding.
Signed-off-by: gotjosh josue@grafana.com
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]