Wait again for stabilisation after instance termination #403

sihil · 2016-12-05T12:59:28Z

Fixing CheckForStabilization in #391 has revealed that an ASG can continue to be unstable after a deploy has completed. This happens because the CullInstancesWithTerminationTag task is asynchronous and causes an ASG to become unstable again (the ASG desired number < actual number) whilst the ASG kills off the excess instances.

This state doesn't last for long, but long enough that the CheckForStabilization task of a subsequent deploy can fail if it runs immediately after the initial deploy.

This PR introduces a small pause and then a further execution of the WaitForStabilization task to wait for the ASG to become stable again. I believe that this should probably only add the pause (default 10s) to most deploys, but might add another 30s if the check doesn't immediately pass.

philwills · 2016-12-05T15:04:18Z

Making deploys slower (even if it's only apparent completion) makes me sad, but trading for consistency is worth a 👍

jfsoul · 2016-12-05T15:10:17Z

Looks good to me, I will keep an eye out for close-together deploys that may have previously caused some failures. Thanks!

#403

This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description indicates this was a necessary choice. We can see the argument for it - obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by alarms - but if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it needs to scale up to handle the load it's experiencing. By simply putting the final `WaitForStabilization` step _after_ `ResumeAlarmNotifications`, the Ophan outage would have been shortened from 1 hour to ~2 minutes. The `WaitForStabilization` step itself simply checks that the number of instances in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or twice, and the `WaitForStabilization` condition will still be easily satisfied, and the deploy will be reported as success.

This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it _needs to scale up_ to handle the load it's experiencing. In this change, we introduce a new task, `WaitForCullToComplete`, that can establish whether the cull has completed or not, regardless of whether the ASG is scaling - it simply checks that there are no remaining instances tagged for termination. Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`, and then `WaitForCullToComplete` _afterwards_. With this change in place, the Ophan outage would have been shortened from 1 hour to ~2 minutes, a much better outcome! Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has been factored out into a new `CullSummary` class.

Wait again for stabilisation after instance termination

6e894a3

sihil assigned jfsoul Dec 5, 2016

sihil merged commit 2b40c17 into master Dec 5, 2016

jukecraft deleted the sh-stabilise-after-termination branch September 4, 2018 16:36

rtyley added a commit that referenced this pull request May 23, 2024

Re-enable ASG scaling before final stabilisation

94f50b9

#403

rtyley mentioned this pull request May 23, 2024

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait again for stabilisation after instance termination #403

Wait again for stabilisation after instance termination #403

sihil commented Dec 5, 2016

philwills commented Dec 5, 2016

jfsoul commented Dec 5, 2016 •

edited

Loading

Wait again for stabilisation after instance termination #403

Wait again for stabilisation after instance termination #403

Conversation

sihil commented Dec 5, 2016

philwills commented Dec 5, 2016

jfsoul commented Dec 5, 2016 • edited Loading

jfsoul commented Dec 5, 2016 •

edited

Loading