Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspend AlarmNotifications at start of an autoscale deploy and resume after #83

Merged
merged 1 commit into from
Apr 12, 2013

Conversation

gklopper
Copy link
Contributor

This is so that autoscaling does not interfere with deploys.

riff-raff

… at end.

This is so that autoscaling does not interfere with deploys.
@gklopper
Copy link
Contributor Author

@philwills will you take a look please?

@philwills
Copy link
Contributor

:+1 Looks good

sihil added a commit that referenced this pull request Apr 12, 2013
Suspend AlarmNotifications at start of an autoscale deploy and resume after
@sihil sihil merged commit 52bc76c into master Apr 12, 2013
@philwills philwills deleted the alarm-notifications branch April 12, 2013 10:48
rtyley added a commit that referenced this pull request May 23, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request May 23, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request May 30, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request Jun 4, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and nothing in the PR description
indicates this was a necessary choice. We can see the argument for it -
obviously, the ASG will be quicker to stabilise if it's not being auto-scaled by
alarms - but if the ASG instances are already overloaded and recycling, the ASG
will _never_ stabilise, because it needs to scale up to handle the load it's
experiencing.

By simply putting the final `WaitForStabilization` step _after_
`ResumeAlarmNotifications`, the Ophan outage would have been shortened from
1 hour to ~2 minutes.

The `WaitForStabilization` step itself simply checks that the number of instances
in the ASG & ELB matches the _desired size_ of the ASG. So long as the ASG is not
scaling up & down very rapidly, we can easily tolerate the ASG scaling up once or
twice, and the `WaitForStabilization` condition will still be easily satisfied,
and the deploy will be reported as success.
rtyley added a commit that referenced this pull request Jun 4, 2024
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_
`ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are
already overloaded and recycling, the ASG will _never_ stabilise, because it _needs
to scale up_ to handle the load it's experiencing.

In this change, we introduce a new task, `WaitForCullToComplete`, that can establish
whether the cull has completed or not, regardless of whether the ASG is scaling -
it simply checks that there are no remaining instances tagged for termination.
Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old
instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`,
and then `WaitForCullToComplete` _afterwards_.

With this change in place, the Ophan outage would have been shortened from
1 hour to ~2 minutes, a much better outcome!

Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has
been factored out into a new `CullSummary` class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants