-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspend AlarmNotifications at start of an autoscale deploy and resume after #83
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… at end. This is so that autoscaling does not interfere with deploys.
@philwills will you take a look please? |
:+1 Looks good |
sihil
added a commit
that referenced
this pull request
Apr 12, 2013
Suspend AlarmNotifications at start of an autoscale deploy and resume after
rtyley
added a commit
that referenced
this pull request
May 23, 2024
This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description indicates this was a necessary choice. We can see the argument for it - obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by alarms - but if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it needs to scale up to handle the load it's experiencing. By simply putting the final `WaitForStabilization` step _after_ `ResumeAlarmNotifications`, the Ophan outage would have been shortened from 1 hour to ~2 minutes. The `WaitForStabilization` step itself simply checks that the number of instances in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or twice, and the `WaitForStabilization` condition will still be easily satisfied, and the deploy will be reported as success.
rtyley
added a commit
that referenced
this pull request
May 23, 2024
This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description indicates this was a necessary choice. We can see the argument for it - obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by alarms - but if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it needs to scale up to handle the load it's experiencing. By simply putting the final `WaitForStabilization` step _after_ `ResumeAlarmNotifications`, the Ophan outage would have been shortened from 1 hour to ~2 minutes. The `WaitForStabilization` step itself simply checks that the number of instances in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or twice, and the `WaitForStabilization` condition will still be easily satisfied, and the deploy will be reported as success.
rtyley
added a commit
that referenced
this pull request
May 30, 2024
This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description indicates this was a necessary choice. We can see the argument for it - obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by alarms - but if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it needs to scale up to handle the load it's experiencing. By simply putting the final `WaitForStabilization` step _after_ `ResumeAlarmNotifications`, the Ophan outage would have been shortened from 1 hour to ~2 minutes. The `WaitForStabilization` step itself simply checks that the number of instances in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or twice, and the `WaitForStabilization` condition will still be easily satisfied, and the deploy will be reported as success.
rtyley
added a commit
that referenced
this pull request
Jun 4, 2024
This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description indicates this was a necessary choice. We can see the argument for it - obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by alarms - but if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it needs to scale up to handle the load it's experiencing. By simply putting the final `WaitForStabilization` step _after_ `ResumeAlarmNotifications`, the Ophan outage would have been shortened from 1 hour to ~2 minutes. The `WaitForStabilization` step itself simply checks that the number of instances in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or twice, and the `WaitForStabilization` condition will still be easily satisfied, and the deploy will be reported as success.
rtyley
added a commit
that referenced
this pull request
Jun 4, 2024
This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it _needs to scale up_ to handle the load it's experiencing. In this change, we introduce a new task, `WaitForCullToComplete`, that can establish whether the cull has completed or not, regardless of whether the ASG is scaling - it simply checks that there are no remaining instances tagged for termination. Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`, and then `WaitForCullToComplete` _afterwards_. With this change in place, the Ophan outage would have been shortened from 1 hour to ~2 minutes, a much better outcome! Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has been factored out into a new `CullSummary` class.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is so that autoscaling does not interfere with deploys.