This repository has been archived by the owner on Sep 17, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #12 from guardian/aa/scale-out-mid-deploy
Add observations of mid-deploy scale-out events
- Loading branch information
Showing
2 changed files
with
128 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# What happens when a scale-out event occurs mid-deployment? | ||
|
||
In this test we deployed [application version](../dist) `ABC` to `XYZ` in the stack | ||
[`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`). | ||
|
||
The aim of this test is to understand the behaviour in the ["Ophan scenario"](https://github.com/guardian/riff-raff/issues/1342). | ||
|
||
## Highlights | ||
The deployment leaves the service under-capacity. | ||
|
||
> [!TIP] | ||
> Full details can be seen in [the dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725344340000&to=1725344705000&var-App=scaling). | ||
## Timeline | ||
1. [Build number 98 was deployed](https://riffraff.gutools.co.uk/deployment/view/60d5b8d1-3535-4948-a096-adf924c8ee43) to start from a clean slate, running artifact `ABC` | ||
2. [Build number 100 was deployed](https://riffraff.gutools.co.uk/deployment/view/49d4a159-0c64-4594-a95b-b9bd12205aa6) updating to use artifact `XYZ` | ||
3. The CFN stack `playground-CODE-scaling-asg-rolling-update` begins to update the ASG, setting the min and desired to 6: | ||
|
||
> Temporarily setting autoscaling group MinSize and DesiredCapacity to 6. | ||
Consequently, the ASG now has a capacity of: | ||
|
||
| Capacity | Value | | ||
|----------|-------| | ||
| Min | 6 | | ||
| Desired | 6 | | ||
| Max | 9 | | ||
|
||
4. The [`scale-out` script](../script/scale-out) was executed twice, setting the desired from 6 to 7, then 7 to 8: | ||
|
||
> At 2024-09-03T06:22:07Z a user request executed policy playground-CODE-scaling-asg-rolling-update-ScaleOut-hhE4chjHHTWQ changing the desired capacity from 7 to 8. | ||
> At 2024-09-03T06:22:15Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 7 to 8. | ||
The ASG now has a capacity of: | ||
|
||
| Capacity | Value | | ||
|----------|-------| | ||
| Min | 6 | | ||
| Desired | 8 | | ||
| Max | 9 | | ||
|
||
5. CloudFormation received `SUCCESS` signals and proceeds to terminate the old instances, and also alter the ASG capacity. | ||
|
||
From the ASG activity: | ||
> At 2024-09-03T06:23:23Z a user request update of AutoScalingGroup constraints to min: 3, max: 9, desired: 6 changing the desired capacity from 8 to 6. | ||
> At 2024-09-03T06:23:29Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 5 to 3. | ||
> At 2024-09-03T06:23:29Z instance i-0220958315f410f0b was selected for termination. | ||
> At 2024-09-03T06:23:29Z instance i-02b01e80c9a3a6d99 was selected for termination. | ||
Curiously, CloudFormation has not acknowledged the updated capacity from step 4. | ||
6. The CloudFormation update finishes, and the final capacity of the ASG is: | ||
|
||
| Capacity | Value | | ||
|----------|-------| | ||
| Min | 3 | | ||
| Desired | 3 | | ||
| Max | 9 | | ||
|
||
This is under-capacity, as the scale-in event has not yet occurred. | ||
|
||
Once the scale-out alarm is evaluated, the ASG capacity will be updated. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# What happens when a scale-out event occurs mid-deployment of an unhealthy artifact? | ||
|
||
In this test we deployed [application version](../dist) `ABC` to `500` in the stack | ||
[`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`). | ||
|
||
The aim of this test is to understand the behaviour in the ["Ophan scenario"](https://github.com/guardian/riff-raff/issues/1342). | ||
|
||
## Highlights | ||
The deployment leaves the service over-capacity. | ||
|
||
> [!TIP] | ||
> Full details can be seen in [the dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725346500000&to=1725347100000&var-App=scaling). | ||
## Timeline | ||
1. [Build number 98 was deployed](https://riffraff.gutools.co.uk/deployment/view/b74ca6b1-8c76-4189-89c1-5b8e480b72e9) to start from a clean slate, running artifact `ABC` | ||
2. [Build number 99 was deployed](https://riffraff.gutools.co.uk/deployment/view/273d3784-88a9-4445-9948-91fc0e1af389) updating to use artifact `500` | ||
3. The CFN stack `playground-CODE-scaling-asg-rolling-update` begins to update the ASG, setting the min and desired to 6: | ||
|
||
From the CFN events: | ||
> Temporarily setting autoscaling group MinSize and DesiredCapacity to 6. | ||
Consequently, the ASG now has a capacity of: | ||
|
||
| Capacity | Value | | ||
|----------|-------| | ||
| Min | 6 | | ||
| Desired | 6 | | ||
| Max | 9 | | ||
|
||
4. The [`scale-out` script](../script/scale-out) was executed twice, setting the desired from 6 to 7, then 7 to 8. | ||
The ASG launches two new instances with the updated (broken) launch template. | ||
|
||
From the ASG activity: | ||
> At 2024-09-03T06:56:23Z a user request executed policy playground-CODE-scaling-asg-rolling-update-ScaleOut-hhE4chjHHTWQ changing the desired capacity from 7 to 8. | ||
> At 2024-09-03T06:56:31Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 7 to 8. | ||
The ASG now has a capacity of: | ||
|
||
| Capacity | Value | | ||
|----------|-------| | ||
| Min | 6 | | ||
| Desired | 8 | | ||
| Max | 9 | | ||
5. CloudFormation does not receive any `SUCCESS` signals, and proceeds to rollback. | ||
|
||
From the CFN events: | ||
> Rolling update initiated. | ||
> Terminating 5 obsolete instance(s) in batches of 5, while keeping at least 3 instance(s) in service. | ||
> Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group. | ||
6. CloudFormation has rolled back the launch template and now proceeds to launch new instances: | ||
|
||
From the CFN events: | ||
> New instance(s) added to autoscaling group - Waiting on 5 resource signal(s) with a timeout of PT5M. | ||
From the ASG activity: | ||
> At 2024-09-03T07:01:15Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 3 to 8. | ||
7. The CloudFormation rollback completes, and the final capacity of the ASG is: | ||
|
||
| Capacity | Value | | ||
|----------|-------| | ||
| Min | 3 | | ||
| Desired | 8 | | ||
| Max | 9 | | ||
|
||
This is at (scaled out) capacity. The instances are running artifact `ABC`. | ||
|
||
It is worth noting that the target group, expectedly, has eight healthy hosts only at this stage. |