This repository has been archived by the owner on Sep 17, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Add observation for partially scaled up deployment (#9)
- Loading branch information
1 parent
614a024
commit 784693a
Showing
1 changed file
with
54 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# What happens when deploying a 'good' build when the service is already partially scaled up? | ||
|
||
In this test we went from [application version](../dist) `ABC` to `XYZ` in the stack: | ||
- [`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`) | ||
|
||
The main aim of this test was to establish whether deploying whilst a service is partially scaled up works as desired. | ||
|
||
## Highlights | ||
|
||
The current implementation leads to a temporary scale down during deployment; this is undesirable! | ||
|
||
Some potential options for mitigating this problem are discussed below. | ||
|
||
## Timeline | ||
|
||
1. [Build number 79 was deployed](https://riffraff.gutools.co.uk/deployment/view/322bf36b-fd4d-4fe4-bf72-ce56bb789a08) (in order to start the test from a clean state - running build `ABC`) | ||
2. Artificial traffic was sent to `https://scaling.rolling-update.gutools.co.uk/healthcheck` in order to trigger a scale-up event | ||
3. The service scales up to 6 instances (from 3) | ||
4. [Build number 80 was deployed](https://riffraff.gutools.co.uk/deployment/view/da3d810c-d055-4765-9602-141233e82b45) (which updates to build `XYZ`) | ||
5. The CFN stack `playground-CODE-scaling-asg-rolling-update` started updating: | ||
|
||
First: | ||
> Rolling update initiated. Terminating 6 obsolete instance(s) in batches of 6, while keeping at least 3 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group. | ||
Then the ASG capacity is updated: | ||
> Temporarily setting autoscaling group MinSize and DesiredCapacity to 9. | ||
6. Once three `SUCCESS` signals are received, six instances are terminated. _At this point we are under-provisioned._ | ||
7. Three more new instances are launched. | ||
8. Once three more `SUCCESS` signals are received, the deployment completes. _At this point we are provisioned correctly again._ | ||
|
||
Unfortunately this means that the deployment causes us to temporarily run with 3 instances serving traffic when we really need 6 to cope with the load | ||
(see [healthy hosts panel](https://metrics.gutools.co.uk/goto/Tt1IPB3SR?orgId=1)). | ||
|
||
Full details can be seen in the [dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725025200000&to=1725026399000&var-App=scaling). | ||
|
||
## Potential Mitigations | ||
|
||
I think this problem could be mitigated if: | ||
|
||
1. We restricted [maxBatchSize](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L49) to 1[^1]. | ||
|
||
OR | ||
|
||
2. We (Riff-Raff?) injected the desired capacity at the start of the deployment in order to set the [`minInstancesInService` property](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L50). | ||
|
||
Unfortunately both of these approaches have drawbacks. | ||
|
||
Option 1 still involves being slightly under-provisioned (by 1 instance). It would also slow down deployments considerably, and may not be viable if | ||
a service is running a lot of instances. | ||
|
||
Option 2 increases complexity and introduces a race condition; if the service scales up after the desired capacity is checked, the deployment will still cause a temporary scale down. | ||
|
||
[^1]: We also tried setting `maxBatchSize=minimumCapacity`, but the behaviour was worse than the current solution. The desired capacity was never increased; instead AWS terminated 3 instances and waited for 3 new ones (then repeated this process), meaning that we were under-provisioned twice during the deployment (instead of once). |