docs: Add observation for partially scaled up deployment (#9)

guardian · Sep 3, 2024 · 784693a · 784693a
1 parent 614a024
commit 784693a
Showing 1 changed file with 54 additions and 0 deletions.
diff --git a/observations/healthy-to-healthy-partially-scaled.md b/observations/healthy-to-healthy-partially-scaled.md
@@ -0,0 +1,54 @@
+# What happens when deploying a 'good' build when the service is already partially scaled up?
+
+In this test we went from [application version](../dist) `ABC` to `XYZ` in the stack:
+- [`ScalingAsgRollingUpdate`](../packages/cdk/lib/scaling-asg-rolling-update.ts) (CFN stack `playground-CODE-scaling-asg-rolling-update`)
+
+The main aim of this test was to establish whether deploying whilst a service is partially scaled up works as desired.
+
+## Highlights
+
+The current implementation leads to a temporary scale down during deployment; this is undesirable!
+
+Some potential options for mitigating this problem are discussed below.
+
+## Timeline
+
+1. [Build number 79 was deployed](https://riffraff.gutools.co.uk/deployment/view/322bf36b-fd4d-4fe4-bf72-ce56bb789a08) (in order to start the test from a clean state - running build `ABC`)
+2. Artificial traffic was sent to `https://scaling.rolling-update.gutools.co.uk/healthcheck` in order to trigger a scale-up event
+3. The service scales up to 6 instances (from 3)
+4. [Build number 80 was deployed](https://riffraff.gutools.co.uk/deployment/view/da3d810c-d055-4765-9602-141233e82b45) (which updates to build `XYZ`)
+5. The CFN stack `playground-CODE-scaling-asg-rolling-update` started updating:
+
+   First:
+   > Rolling update initiated. Terminating 6 obsolete instance(s) in batches of 6, while keeping at least 3 instance(s) in service. Waiting on resource signals with a timeout of PT5M when new instances are added to the autoscaling group.
+   
+   Then the ASG capacity is updated:
+   > Temporarily setting autoscaling group MinSize and DesiredCapacity to 9.
+
+6. Once three `SUCCESS` signals are received, six instances are terminated. _At this point we are under-provisioned._
+7. Three more new instances are launched. 
+8. Once three more `SUCCESS` signals are received, the deployment completes. _At this point we are provisioned correctly again._
+
+Unfortunately this means that the deployment causes us to temporarily run with 3 instances serving traffic when we really need 6 to cope with the load
+(see [healthy hosts panel](https://metrics.gutools.co.uk/goto/Tt1IPB3SR?orgId=1)).
+
+Full details can be seen in the [dashboard](https://metrics.gutools.co.uk/d/cdvsv1d6vhp1cb/testing-asg-rolling-update?orgId=1&from=1725025200000&to=1725026399000&var-App=scaling).
+
+## Potential Mitigations
+
+I think this problem could be mitigated if:
+
+1. We restricted [maxBatchSize](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L49) to 1[^1].
+
+OR
+
+2. We (Riff-Raff?) injected the desired capacity at the start of the deployment in order to set the [`minInstancesInService` property](https://github.com/guardian/cdk/blob/00ef0467d7797629015f088f969e2bcdab472046/src/experimental/patterns/ec2-app.ts#L50).
+
+Unfortunately both of these approaches have drawbacks. 
+
+Option 1 still involves being slightly under-provisioned (by 1 instance). It would also slow down deployments considerably, and may not be viable if 
+a service is running a lot of instances. 
+
+Option 2 increases complexity and introduces a race condition; if the service scales up after the desired capacity is checked, the deployment will still cause a temporary scale down.
+
+[^1]: We also tried setting `maxBatchSize=minimumCapacity`, but the behaviour was worse than the current solution. The desired capacity was never increased; instead AWS terminated 3 instances and waited for 3 new ones (then repeated this process), meaning that we were under-provisioned twice during the deployment (instead of once).