Skip to content

Commit

Permalink
Re-enable ASG scaling before final stabilisation
Browse files Browse the repository at this point in the history
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
  • Loading branch information
rtyley committed May 23, 2024
1 parent c743364 commit 6c8f8f7
Showing 1 changed file with 2 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -197,12 +197,12 @@ object AutoScaling extends DeploymentType with BucketParameters {
target.region,
terminationGrace(pkg, target, reporter) * 1000
),
ResumeAlarmNotifications(autoScalingGroup, target.region),
WaitForStabilization(
autoScalingGroup,
secondsToWait(pkg, target, reporter) * 1000,
target.region
),
ResumeAlarmNotifications(autoScalingGroup, target.region)
)
)
}
val groupsToUpdate: List[AutoScalingGroupInfo] =
Expand Down

0 comments on commit 6c8f8f7

Please sign in to comment.