-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(experimental-ec2-pattern): Pattern to deploy ASG updates w/CFN #2417
Conversation
🦋 Changeset detectedLatest commit: 7a12f60 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
2812ba0
to
11e2fb6
Compare
7a1f2e7
to
d6e6f9e
Compare
new CfnParameter(this.stack, `MinInstancesInServiceFor${autoScalingGroup.app}`, { | ||
type: "Number", | ||
default: parseInt(cfnAutoScalingGroup.minSize), | ||
maxValue: parseInt(cfnAutoScalingGroup.maxSize) - 1, | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the app
contains hyphens (-
), or other illegal characters for a CFN Parameter, they'll get stripped out. This means Riff-Raff cannot reliably just find CFN Parameters that start with MinInstancesInServiceFor
. Therefore Riff-Raff's logic would be something like this:
- Find all
AWS::AutoScaling::ScalingPolicy
s in the CloudFormation template that hasMinInstancesInService
set via a CloudFormation Parameter - Find current
DesiredCapacity
of ASG that relates to the Scaling Policy - Set the CFN Parameter identified in step 1 to the value identified in step 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if our Aspect
could add Outputs
for Riff-Raff to read.
This would mean that step 1 can be simplified, as Riff-Raff would not need to parse the full template or lookup the associated ASG name (which will not be included in the template unless this property is set).
For example:
Outputs:
MinInstancesInServiceForAppA:
Description: The AutoScaling group name associated with the MinInstancesInServiceForAppA parameter.
Value: !Ref AsgForAppA
MinInstancesInServiceForAppB:
Description: The AutoScaling group name associated with the MinInstancesInServiceForAppB parameter.
Value: !Ref AsgForAppB
I think we can obtain these outputs pretty easily via the existing describeStack
function and IIUC using !Ref
here will mean that we can avoid looking up the ASG name ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting idea! Tempted to make these changes in a separate PR once we've know more about what the changes in Riff-Raff would look like. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tempted to make these changes in a separate PR once we've know more about what the changes in Riff-Raff would look like. WDYT?
Yes, that sounds good to me 👍
AutoScalingRollingUpdate
)AutoScalingRollingUpdate
)
AutoScalingRollingUpdate
)f6dc526
to
b4801a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work 💯
…a CloudFormation (`AutoScalingRollingUpdate`)
…arkers This should make it easier to parse a user data string if ever one is debugging.
During a deployment, CloudFormation updates the min and desired. During a rollback (e.g. if the healthcheck failed), CloudFormation only resets the min. The desired is still elevated, meaning the service is over provisioned. Explicitly setting the desired property of the ASG ensures CloudFormation rollback puts the ASG back to the initial state, e.g. correctly provisioned.
…ng and updating ASG
A scale-in event fires during a rolling update can cause service disruption. Suspend scaling events during a rolling update for safety.
…where scaling policy present Some practical testing of `AutoScalingRollingUpdate` has demonstrated that when an ASG has a scaling policy, it is safest to dynamically set the `MinInstancesInService` property. Add an aspect to do that. See also https://github.com/guardian/testing-asg-rolling-update.
The `ec2metadata` command was failing with a 401 with AMIable CODE in deployTools account: ```console root@ip-10-248-51-213:/var/lib/cloud/instance# ec2metadata --instance-id Traceback (most recent call last): File "/usr/bin/ec2metadata", line 249, in <module> main() File "/usr/bin/ec2metadata", line 245, in main display(metaopts, burl, prefix) File "/usr/bin/ec2metadata", line 192, in display value = m.get(metaopt) File "/usr/bin/ec2metadata", line 177, in get return self._get('meta-data/' + metaopt) File "/usr/bin/ec2metadata", line 137, in _get resp = urllib_request.urlopen(urllib_request.Request(url)) File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/usr/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(*args) File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 401: Unautho ``` This service uses IMDSv2. A 401 response usually happens when a request is made without a token. However `ec2metadata` does exchange a token. Switch to a more reliable mechanism. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html.
The behaviour being tested is already covered by `should only adjust properties of a horizontally scaling service`.
… period Matching these properties allows rollbacks to happen as quickly as possible.
6276b0e
to
7a12f60
Compare
Usage of this experimental pattern can be found via Service Catalogue. |
What does this change?
Reduce the “time spent broken” after failed deploys, making these manual steps redundant, by adding a new experimental pattern
GuEc2AppExperimental
to provision an EC2 based service with anAutoScalingRollingUpdate
update policy. The configuration is based on a number of tests we ran; we've documented some various observations.This is an alternative to #2395. It improves on the solution proposed in #2395 by providing continuity of ASG metrics and ASG activity history.
AutoScalingRollingUpdate
offers an improved DX over theautoscaling
Riff-Raff deployment type when a deployment fails. On deployment failure, theautoscaling
Riff-Raff deployment type leaves the ASG over-capacity, requiring manual intervention to resolve, whereasAutoScalingRollingUpdate
will perform a rollback automatically. This means deployments are more atomic.Note
Unfortunately testing has shown unpredictable behaviour if a scale-in event occurs during deployment. Therefore we suspend the
AlarmNotification
process during a rolling update.This means scaling events cannot occur during deployment and guardian/riff-raff#1342 is not fixed.
With
AutoScalingRollingUpdate
CloudFormation will deploy application updates to an ASG in a rolling fashion. The rate at which the update rolls out is dependant on how scaled-out an ASG is:If an ASG does not have any scaling policies, deployment will behave similar to Riff-Raff; the
DesiredCapacity
will be doubled, once new instances are healthy the capacity will be halved removing the old instances.For an ASG that has a scaling policy, the rate will be dynamic:
DesiredCapacity
will be doubled, once new instances are healthy the capacity will be halved removing the old instances (testing observations).MinCapacity
/ (MaxCapacity
-DesiredCapacity
)) (testing observations).This will be achieved by Riff-Raff setting
MinInstancesInService
at deploy time via a CloudFormation Parameter.This functionality is yet to be added to Riff-Raff. It doesn't block shipping this change, however services that use scaling policies cannot use this experimental pattern until then.
As application updates are performed entirely by CloudFormation, it is not necessary to use the
autoscaling
RIff-Raff deployment type1. If it is used, it shouldn't break anything. However it does mean deployments will take a long time, as the instances will be rotated twice; once by CloudFormation and once by Riff-Raff's custom logic.Changes to UserData
With
AutoScalingRollingUpdate
CloudFormation will remain in theUPDATE_IN_PROGRESS
state whilst waiting for aSUCCESS
signal to be received from the instance. This signal is sent manually.The
GuEc2AppExperimental
pattern will add this signal emission to the UserData. ASUCCESS
signal will be sent once the instance has successfully registered with the Target Group.Why
experimental
?There are a few requirements for this approach:
cfn-signal
binaries need to be available and on thePATH
(this is added via the AMIgo roleaws-tools
).We're not (yet) validating these are met, as its tricky.
Proposed rollout
The rollout plan looks something like this:
How to test
These changes have been informed by "real-world" testing that has been documented in https://github.com/guardian/testing-asg-rolling-update.
This branch is being successfully used in guardian/amiable, guardian/cdk-playground and guardian/security-hq.
How can we measure success?
A DX improvement when a deployment fails.
Once fully adopted, we will be able to remove the
autoscaling
deployment type in Riff-Raff, reducing it's complexity.Have we considered potential risks?
As this new pattern is
experimental
there risk levels are fairly small. We should monitor usage and gather feedback before promoting to stable.Checklist
Footnotes
We still need the
uploadArtifacts
action of theautoscaling
deployment type. Thecloud-formation
deployment type should depend on this step too. ↩Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs. ↩
If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid? ↩