Suspend AlarmNotifications at start of an autoscale deploy and resume after #83

gklopper · 2013-04-11T17:12:06Z

This is so that autoscaling does not interfere with deploys.

… at end. This is so that autoscaling does not interfere with deploys.

gklopper · 2013-04-11T17:13:35Z

@philwills will you take a look please?

philwills · 2013-04-12T09:22:40Z

:+1 Looks good

Suspend AlarmNotifications at start of an autoscale deploy and resume after

This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description indicates this was a necessary choice. We can see the argument for it - obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by alarms - but if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it needs to scale up to handle the load it's experiencing. By simply putting the final `WaitForStabilization` step _after_ `ResumeAlarmNotifications`, the Ophan outage would have been shortened from 1 hour to ~2 minutes. The `WaitForStabilization` step itself simply checks that the number of instances in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or twice, and the `WaitForStabilization` condition will still be easily satisfied, and the deploy will be reported as success.

This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it _needs to scale up_ to handle the load it's experiencing. In this change, we introduce a new task, `WaitForCullToComplete`, that can establish whether the cull has completed or not, regardless of whether the ASG is scaling - it simply checks that there are no remaining instances tagged for termination. Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`, and then `WaitForCullToComplete` _afterwards_. With this change in place, the Ophan outage would have been shortened from 1 hour to ~2 minutes, a much better outcome! Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has been factored out into a new `CullSummary` class.

Suspend AlarmNotifications at start of an autoscale deploy and resume…

91f4ec0

… at end. This is so that autoscaling does not interfere with deploys.

sihil added a commit that referenced this pull request Apr 12, 2013

Merge pull request #83 from guardian/alarm-notifications

52bc76c

Suspend AlarmNotifications at start of an autoscale deploy and resume after

sihil merged commit 52bc76c into master Apr 12, 2013

philwills deleted the alarm-notifications branch April 12, 2013 10:48

rtyley mentioned this pull request May 23, 2024

Apps can not auto-scale until an autoscaling deploy has successfully completed #1342

Open

rtyley mentioned this pull request May 23, 2024

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspend AlarmNotifications at start of an autoscale deploy and resume after #83

Suspend AlarmNotifications at start of an autoscale deploy and resume after #83

gklopper commented Apr 11, 2013

gklopper commented Apr 11, 2013

philwills commented Apr 12, 2013

Suspend AlarmNotifications at start of an autoscale deploy and resume after #83

Suspend AlarmNotifications at start of an autoscale deploy and resume after #83

Conversation

gklopper commented Apr 11, 2013

gklopper commented Apr 11, 2013

philwills commented Apr 12, 2013