Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait again for stabilisation after instance termination #403

Merged
merged 1 commit into from
Dec 5, 2016

Conversation

sihil
Copy link
Contributor

@sihil sihil commented Dec 5, 2016

Fixing CheckForStabilization in #391 has revealed that an ASG can continue to be unstable after a deploy has completed. This happens because the CullInstancesWithTerminationTag task is asynchronous and causes an ASG to become unstable again (the ASG desired number < actual number) whilst the ASG kills off the excess instances.

This state doesn't last for long, but long enough that the CheckForStabilization task of a subsequent deploy can fail if it runs immediately after the initial deploy.

This PR introduces a small pause and then a further execution of the WaitForStabilization task to wait for the ASG to become stable again. I believe that this should probably only add the pause (default 10s) to most deploys, but might add another 30s if the check doesn't immediately pass.

@philwills
Copy link
Contributor

Making deploys slower (even if it's only apparent completion) makes me sad, but trading for consistency is worth a 👍

@jfsoul
Copy link
Contributor

jfsoul commented Dec 5, 2016

Looks good to me, I will keep an eye out for close-together deploys that may have previously caused some failures. Thanks!

@sihil sihil merged commit 2b40c17 into master Dec 5, 2016
@jukecraft jukecraft deleted the sh-stabilise-after-termination branch September 4, 2018 16:36
rtyley added a commit that referenced this pull request May 23, 2024
rtyley added a commit that referenced this pull request May 23, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request May 23, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request May 30, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request Jun 4, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request Jun 4, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_
`ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are
already overloaded and recycling, the ASG will _never_ stabilise, because it _needs
to scale up_ to handle the load it's experiencing.

In this change, we introduce a new task, `WaitForCullToComplete`, that can establish
whether the cull has completed or not, regardless of whether the ASG is scaling -
it simply checks that there are no remaining instances tagged for termination.
Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old
instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`,
and then `WaitForCullToComplete` _afterwards_.

With this change in place, the Ophan outage would have been shortened from
1 hour to ~2 minutes, a much better outcome!

Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has
been factored out into a new `CullSummary` class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants