Skip to content
This repository has been archived by the owner on Sep 17, 2024. It is now read-only.

Commit

Permalink
docs: Add observation for partially scaled up deployment (#9)
Browse files Browse the repository at this point in the history
  • Loading branch information
jacobwinch authored Sep 3, 2024
1 parent 614a024 commit 784693a
Showing 1 changed file with 54 additions and 0 deletions.
54 changes: 54 additions & 0 deletions observations/healthy-to-healthy-partially-scaled.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# What happens when deploying a 'good' build when the service is already partially scaled up?

In this test we went from [application version](../dist) `ABC` to `XYZ` in the stack:
- [`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`)

The main aim of this test was to establish whether deploying whilst a service is partially scaled up works as desired.

## Highlights

The current implementation leads to a temporary scale down during deployment; this is undesirable!

Some potential options for mitigating this problem are discussed below.

## Timeline

1. [Build number 79 was deployed](https://riffraff.gutools.co.uk/deployment/view/322bf36b-fd4d-4fe4-bf72-ce56bb789a08) (in order to start the test from a clean state - running build `ABC`)
2. Artificial traffic was sent to `https://scaling.rolling-update.gutools.co.uk/healthcheck` in order to trigger a scale-up event
3. The service scales up to 6 instances (from 3)
4. [Build number 80 was deployed](https://riffraff.gutools.co.uk/deployment/view/da3d810c-d055-4765-9602-141233e82b45) (which updates to build `XYZ`)
5. The CFN stack `playground-CODE-scaling-asg-rolling-update` started updating:

First:
> Rolling update initiated. Terminating 6 obsolete instance(s) in batches of 6, while keeping at least 3 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
Then the ASG capacity is updated:
> Temporarily setting autoscaling group MinSize and DesiredCapacity to 9.
6. Once three `SUCCESS` signals are received, six instances are terminated. _At this point we are under-provisioned._
7. Three more new instances are launched.
8. Once three more `SUCCESS` signals are received, the deployment completes. _At this point we are provisioned correctly again._

Unfortunately this means that the deployment causes us to temporarily run with 3 instances serving traffic when we really need 6 to cope with the load
(see [healthy hosts panel](https://metrics.gutools.co.uk/goto/Tt1IPB3SR?orgId=1)).

Full details can be seen in the [dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725025200000&to=1725026399000&var-App=scaling).

## Potential Mitigations

I think this problem could be mitigated if:

1. We restricted [maxBatchSize](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L49) to 1[^1].

OR

2. We (Riff-Raff?) injected the desired capacity at the start of the deployment in order to set the [`minInstancesInService` property](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L50).

Unfortunately both of these approaches have drawbacks.

Option 1 still involves being slightly under-provisioned (by 1 instance). It would also slow down deployments considerably, and may not be viable if
a service is running a lot of instances.

Option 2 increases complexity and introduces a race condition; if the service scales up after the desired capacity is checked, the deployment will still cause a temporary scale down.

[^1]: We also tried setting `maxBatchSize=minimumCapacity`, but the behaviour was worse than the current solution. The desired capacity was never increased; instead AWS terminated 3 instances and waited for 3 new ones (then repeated this process), meaning that we were under-provisioned twice during the deployment (instead of once).

0 comments on commit 784693a

Please sign in to comment.