From 784693a718d74d613adf718f40af3d691b4d352d Mon Sep 17 00:00:00 2001 From: Jacob Winch Date: Tue, 3 Sep 2024 14:03:48 +0100 Subject: [PATCH] docs: Add observation for partially scaled up deployment (#9) --- .../healthy-to-healthy-partially-scaled.md | 54 +++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 observations/healthy-to-healthy-partially-scaled.md diff --git a/observations/healthy-to-healthy-partially-scaled.md b/observations/healthy-to-healthy-partially-scaled.md new file mode 100644 index 0000000..febd929 --- /dev/null +++ b/observations/healthy-to-healthy-partially-scaled.md @@ -0,0 +1,54 @@ +# What happens when deploying a 'good' build when the service is already partially scaled up? + +In this test we went from [application version](../dist) `ABC` to `XYZ` in the stack: +- [`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`) + +The main aim of this test was to establish whether deploying whilst a service is partially scaled up works as desired. + +## Highlights + +The current implementation leads to a temporary scale down during deployment; this is undesirable! + +Some potential options for mitigating this problem are discussed below. + +## Timeline + +1. [Build number 79 was deployed](https://riffraff.gutools.co.uk/deployment/view/322bf36b-fd4d-4fe4-bf72-ce56bb789a08) (in order to start the test from a clean state - running build `ABC`) +2. Artificial traffic was sent to `https://scaling.rolling-update.gutools.co.uk/healthcheck` in order to trigger a scale-up event +3. The service scales up to 6 instances (from 3) +4. [Build number 80 was deployed](https://riffraff.gutools.co.uk/deployment/view/da3d810c-d055-4765-9602-141233e82b45) (which updates to build `XYZ`) +5. The CFN stack `playground-CODE-scaling-asg-rolling-update` started updating: + + First: + > Rolling update initiated. Terminating 6 obsolete instance(s) in batches of 6, while keeping at least 3 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group. + + Then the ASG capacity is updated: + > Temporarily setting autoscaling group MinSize and DesiredCapacity to 9. + +6. Once three `SUCCESS` signals are received, six instances are terminated. _At this point we are under-provisioned._ +7. Three more new instances are launched. +8. Once three more `SUCCESS` signals are received, the deployment completes. _At this point we are provisioned correctly again._ + +Unfortunately this means that the deployment causes us to temporarily run with 3 instances serving traffic when we really need 6 to cope with the load +(see [healthy hosts panel](https://metrics.gutools.co.uk/goto/Tt1IPB3SR?orgId=1)). + +Full details can be seen in the [dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725025200000&to=1725026399000&var-App=scaling). + +## Potential Mitigations + +I think this problem could be mitigated if: + +1. We restricted [maxBatchSize](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L49) to 1[^1]. + +OR + +2. We (Riff-Raff?) injected the desired capacity at the start of the deployment in order to set the [`minInstancesInService` property](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L50). + +Unfortunately both of these approaches have drawbacks. + +Option 1 still involves being slightly under-provisioned (by 1 instance). It would also slow down deployments considerably, and may not be viable if +a service is running a lot of instances. + +Option 2 increases complexity and introduces a race condition; if the service scales up after the desired capacity is checked, the deployment will still cause a temporary scale down. + +[^1]: We also tried setting `maxBatchSize=minimumCapacity`, but the behaviour was worse than the current solution. The desired capacity was never increased; instead AWS terminated 3 instances and waited for 3 new ones (then repeated this process), meaning that we were under-provisioned twice during the deployment (instead of once).