Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: invalidate SSM cache upon AMI deprecation #7301

Merged
merged 3 commits into from
Oct 31, 2024

Conversation

jmdeal
Copy link
Contributor

@jmdeal jmdeal commented Oct 29, 2024

Fixes #N/A

Description

When an EKS-optimized AMI is deprecated, and a user is using a family@latest alias, Karpenter will continue to launch with that AMI until the 24 hour SSM cache entry has expired. This cache ensures Karpenter doesn't cause a thundering herd by upgrading all users within a region simeltaneously upon a SSM parameter rollout. However, Karpenter should respond faster in the event of an EKS-optimized AMI deprecation.

This PR introduces a mechanism to invalidate the SSM cache upon detection of a deprecated EKS-optimized AMI. This cache invalidation is still staggered over 30 minutes to reduce the risk of a thundering herd, but this is still up to 48x the reaction time Karpenter previously had. Additionally, thanks to the 24 hour cache, only a subset of users using family@latest will have been upgraded to the deprecated AMI in the first place, further reducing the chances of a thundering herd upon rollback.

Some assumptions made:

  • The latest SSM parameter for EKS optimized AMIs should be rolled back before AMI deprecation. If this is not the case, Karpenter will continuously invalidate the SSM cache until the parameter points to a non-deprecated AMI.
  • Karpenter will only need to rollback a subset of users if the issue is detected and the AMI is deprecated within 24 hours of the AMI's release. Otherwise, we expect all users with a family@latest alias (and non-blocking drift budgets / PDBs) to have been upgraded to the now-deprecated AMI.

How was this change tested?
make test (additional tests in-progress)

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jmdeal jmdeal requested a review from a team as a code owner October 29, 2024 19:48
@jmdeal jmdeal requested a review from ellistarn October 29, 2024 19:48
Copy link

netlify bot commented Oct 29, 2024

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 106b0e7
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/672415096affee00083a328f

Copy link
Contributor Author

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-93bd2cd2c2124923e5686e057a4af017b25b360a.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-93bd2cd2c2124923e5686e057a4af017b25b360a" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@coveralls
Copy link

coveralls commented Oct 29, 2024

Pull Request Test Coverage Report for Build 11621534000

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 90 of 117 (76.92%) changed or added relevant lines in 12 files are covered.
  • 9 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 82.679%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 2 0.0%
pkg/apis/v1/ec2nodeclass.go 15 18 83.33%
pkg/operator/operator.go 0 4 0.0%
pkg/controllers/providers/ssm/invalidation/controller.go 35 43 81.4%
pkg/apis/v1/zz_generated.deepcopy.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/providers/instance/instance.go 2 89.12%
pkg/providers/instancetype/types.go 7 96.63%
Totals Coverage Status
Change from base Build 11567721194: -0.2%
Covered Lines: 5704
Relevant Lines: 6899

💛 - Coveralls

@jmdeal jmdeal changed the title feat: invalidate SSM cache upon AMI invalidation feat: invalidate SSM cache upon AMI deprecation Oct 29, 2024
pkg/controllers/nodeclass/status/ami.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclass/status/ami.go Outdated Show resolved Hide resolved
pkg/controllers/nodeclass/status/ami.go Outdated Show resolved Hide resolved
pkg/providers/amifamily/al2.go Outdated Show resolved Hide resolved
pkg/providers/amifamily/ami.go Show resolved Hide resolved
pkg/providers/ssm/provider.go Outdated Show resolved Hide resolved
pkg/providers/ssm/provider.go Outdated Show resolved Hide resolved
pkg/providers/ssm/provider.go Outdated Show resolved Hide resolved
pkg/test/environment.go Outdated Show resolved Hide resolved
pkg/apis/v1/ec2nodeclass.go Show resolved Hide resolved
@jmdeal jmdeal force-pushed the feat/ssm-cache-invalidation branch from 93bd2cd to 5bc8b31 Compare October 31, 2024 00:22
@jmdeal jmdeal force-pushed the feat/ssm-cache-invalidation branch from 5bc8b31 to ace5bc9 Compare October 31, 2024 00:28
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Muy bonita 🎉

cmd/controller/main.go Show resolved Hide resolved
pkg/providers/ssm/types.go Show resolved Hide resolved
pkg/providers/amifamily/ami.go Show resolved Hide resolved
pkg/providers/ssm/provider.go Outdated Show resolved Hide resolved
pkg/controllers/providers/ssm/invalidation/controller.go Outdated Show resolved Hide resolved
pkg/controllers/providers/ssm/invalidation/controller.go Outdated Show resolved Hide resolved
jonathan-innis
jonathan-innis previously approved these changes Oct 31, 2024
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

pkg/controllers/providers/ssm/invalidation/suite_test.go Outdated Show resolved Hide resolved
Co-authored-by: Jonathan Innis <jonathan.innis.ji@gmail.com>
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jmdeal jmdeal enabled auto-merge (squash) October 31, 2024 23:44
@jmdeal jmdeal merged commit 0f72ae0 into aws:main Oct 31, 2024
17 checks passed
jmdeal added a commit to jmdeal/karpenter-provider-aws that referenced this pull request Nov 1, 2024
Co-authored-by: Jonathan Innis <jonathan.innis.ji@gmail.com>
jmdeal added a commit that referenced this pull request Nov 6, 2024
Co-authored-by: Jonathan Innis <jonathan.innis.ji@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants