Skip to content
This repository has been archived by the owner on Sep 17, 2024. It is now read-only.

Commit

Permalink
Merge pull request #12 from guardian/aa/scale-out-mid-deploy
Browse files Browse the repository at this point in the history
Add observations of mid-deploy scale-out events
  • Loading branch information
akash1810 authored Sep 4, 2024
2 parents 784693a + 811b9d7 commit 3d0f02c
Show file tree
Hide file tree
Showing 2 changed files with 128 additions and 0 deletions.
61 changes: 61 additions & 0 deletions observations/healthy-scale-out-mid-deploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# What happens when a scale-out event occurs mid-deployment?

In this test we deployed [application version](../dist) `ABC` to `XYZ` in the stack
[`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`).

The aim of this test is to understand the behaviour in the ["Ophan scenario"](https://github.com/guardian/riff-raff/issues/1342).

## Highlights
The deployment leaves the service under-capacity.

> [!TIP]
> Full details can be seen in [the dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725344340000&to=1725344705000&var-App=scaling).
## Timeline
1. [Build number 98 was deployed](https://riffraff.gutools.co.uk/deployment/view/60d5b8d1-3535-4948-a096-adf924c8ee43) to start from a clean slate, running artifact `ABC`
2. [Build number 100 was deployed](https://riffraff.gutools.co.uk/deployment/view/49d4a159-0c64-4594-a95b-b9bd12205aa6) updating to use artifact `XYZ`
3. The CFN stack `playground-CODE-scaling-asg-rolling-update` begins to update the ASG, setting the min and desired to 6:

> Temporarily setting autoscaling group MinSize and DesiredCapacity to 6.
Consequently, the ASG now has a capacity of:

| Capacity | Value |
|----------|-------|
| Min | 6 |
| Desired | 6 |
| Max | 9 |

4. The [`scale-out` script](../script/scale-out) was executed twice, setting the desired from 6 to 7, then 7 to 8:

> At 2024-09-03T06:22:07Z a user request executed policy playground-CODE-scaling-asg-rolling-update-ScaleOut-hhE4chjHHTWQ changing the desired capacity from 7 to 8.
> At 2024-09-03T06:22:15Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 7 to 8.
The ASG now has a capacity of:

| Capacity | Value |
|----------|-------|
| Min | 6 |
| Desired | 8 |
| Max | 9 |

5. CloudFormation received `SUCCESS` signals and proceeds to terminate the old instances, and also alter the ASG capacity.

From the ASG activity:
> At 2024-09-03T06:23:23Z a user request update of AutoScalingGroup constraints to min: 3, max: 9, desired: 6 changing the desired capacity from 8 to 6.
> At 2024-09-03T06:23:29Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 5 to 3.
> At 2024-09-03T06:23:29Z instance i-0220958315f410f0b was selected for termination.
> At 2024-09-03T06:23:29Z instance i-02b01e80c9a3a6d99 was selected for termination.
Curiously, CloudFormation has not acknowledged the updated capacity from step 4.
6. The CloudFormation update finishes, and the final capacity of the ASG is:

| Capacity | Value |
|----------|-------|
| Min | 3 |
| Desired | 3 |
| Max | 9 |

This is under-capacity, as the scale-in event has not yet occurred.

Once the scale-out alarm is evaluated, the ASG capacity will be updated.
67 changes: 67 additions & 0 deletions observations/unhealthy-scale-out-mid-deploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# What happens when a scale-out event occurs mid-deployment of an unhealthy artifact?

In this test we deployed [application version](../dist) `ABC` to `500` in the stack
[`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`).

The aim of this test is to understand the behaviour in the ["Ophan scenario"](https://github.com/guardian/riff-raff/issues/1342).

## Highlights
The deployment leaves the service over-capacity.

> [!TIP]
> Full details can be seen in [the dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725346500000&to=1725347100000&var-App=scaling).
## Timeline
1. [Build number 98 was deployed](https://riffraff.gutools.co.uk/deployment/view/b74ca6b1-8c76-4189-89c1-5b8e480b72e9) to start from a clean slate, running artifact `ABC`
2. [Build number 99 was deployed](https://riffraff.gutools.co.uk/deployment/view/273d3784-88a9-4445-9948-91fc0e1af389) updating to use artifact `500`
3. The CFN stack `playground-CODE-scaling-asg-rolling-update` begins to update the ASG, setting the min and desired to 6:

From the CFN events:
> Temporarily setting autoscaling group MinSize and DesiredCapacity to 6.
Consequently, the ASG now has a capacity of:

| Capacity | Value |
|----------|-------|
| Min | 6 |
| Desired | 6 |
| Max | 9 |

4. The [`scale-out` script](../script/scale-out) was executed twice, setting the desired from 6 to 7, then 7 to 8.
The ASG launches two new instances with the updated (broken) launch template.

From the ASG activity:
> At 2024-09-03T06:56:23Z a user request executed policy playground-CODE-scaling-asg-rolling-update-ScaleOut-hhE4chjHHTWQ changing the desired capacity from 7 to 8.
> At 2024-09-03T06:56:31Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 7 to 8.
The ASG now has a capacity of:

| Capacity | Value |
|----------|-------|
| Min | 6 |
| Desired | 8 |
| Max | 9 |
5. CloudFormation does not receive any `SUCCESS` signals, and proceeds to rollback.

From the CFN events:
> Rolling update initiated.
> Terminating 5 obsolete instance(s) in batches of 5, while keeping at least 3 instance(s) in service.
> Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
6. CloudFormation has rolled back the launch template and now proceeds to launch new instances:

From the CFN events:
> New instance(s) added to autoscaling group - Waiting on 5 resource signal(s) with a timeout of PT5M.
From the ASG activity:
> At 2024-09-03T07:01:15Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 3 to 8.
7. The CloudFormation rollback completes, and the final capacity of the ASG is:

| Capacity | Value |
|----------|-------|
| Min | 3 |
| Desired | 8 |
| Max | 9 |

This is at (scaled out) capacity. The instances are running artifact `ABC`.

It is worth noting that the target group, expectedly, has eight healthy hosts only at this stage.

0 comments on commit 3d0f02c

Please sign in to comment.