Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-eks: Updating KubernetesManifest deletes it instead #33406

Open
1 task
esun74 opened this issue Feb 12, 2025 · 3 comments
Open
1 task

aws-eks: Updating KubernetesManifest deletes it instead #33406

esun74 opened this issue Feb 12, 2025 · 3 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/medium Medium work item – several days of effort p2

Comments

@esun74
Copy link

esun74 commented Feb 12, 2025

Describe the bug

Updating a KubernetesManifest resource through CDK can actually cause it to get deleted.

During a resource replacement, if overwrite: true and the previous manifest has any overlap with the new manifest, the overlapping section would be lost. When the manifest is unchanged, the entire resource is deleted. Issue cannot be mitigated by a rollback or code revert and will repeat on any subsequent update.

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

Replacing a KubernetesManifest should at most delete and re-create the underlying EKS manifest resources. A minimal update to a KubernetesManifest should not result in a loss of cluster functionality resulting from missing Kubernetes resources.

Current Behavior

Updates to a KubernetesManifest which are applied as a replacement cause cluster resources to be wiped. Rollbacks and reverts do not bring the cluster back to a healthy state.

Given Manifest A (previous) and Manifest B (new) are based on the same yaml, replacing the KubernetesManifest resource looks like this:

  1. Cloudformation first applies Manifest B, which overwrites Manifest A in EKS; nothing happens basically
  2. Manifest A and B both exist in the cloudformation stack, manifest contents are correctly configured in EKS
  3. Cloudformation deletes Manifest A, which deletes the manifest resources from EKS
  4. Cloudformation now has "updated" to Manifest B, but nothing is in EKS anymore

Reproduction Steps

Setup

new eks.KubernetesManifest(cluster, 'Sleeper', {
  manifest: [
    {
      apiVersion: 'v1',
      kind: 'Pod',
      metadata: {
        name: 'test-sleeper',
      },
      spec: {
        containers: [
          {
            name: 'sleeper',
            image: 'alpine:latest',
            imagePullPolicy: 'Always',
            command: ['/bin/sleep', 'infinity'],
          },
        ],
      },
    },
  ],
  cluster,
  overwrite: true,
});
> kubectl get pods

NAME           READY   STATUS    RESTARTS   AGE
test-sleeper   1/1     Running   0          40s

Minimal Change

- new eks.KubernetesManifest(cluster, 'Sleeper', {
+ new eks.KubernetesManifest(cluster, 'Sleeper1', {

CloudFormation Events

Timestamp Logical ID Status
2025-02-11 13:49:18 UTC-0800 ClusterSleeper0E1728F7 DELETE_COMPLETE
2025-02-11 13:48:38 UTC-0800 ClusterSleeper0E1728F7 DELETE_IN_PROGRESS
2025-02-11 13:48:37 UTC-0800 <stack> UPDATE_COMPLETE_CLEANUP_IN_PROGRESS
2025-02-11 13:48:17 UTC-0800 ClusterSleeper1A9127B4A CREATE_COMPLETE
2025-02-11 13:48:17 UTC-0800 ClusterSleeper1A9127B4A CREATE_IN_PROGRESS (Resource creation Initiated)
2025-02-11 13:48:05 UTC-0800 ClusterSleeper1A9127B4A CREATE_IN_PROGRESS
> kubectl get pods

No resources found in default namespace.

Reverts Are Ineffective

- new eks.KubernetesManifest(cluster, 'Sleeper1', {
+ new eks.KubernetesManifest(cluster, 'Sleeper', {

Similar events to above, sleeper pod is created then deleted again.

> kubectl get pods

No resources found in default namespace.

Possible Solution

Immediate Mitigating Options:

  • Trigger a minimal replacement and set the manifest's deletion policy to RETAIN
    (manifest.node.defaultChild as CfnResource).applyRemovalPolicy(RemovalPolicy.RETAIN);
  • Remove the manifest from CDK entirely, deploy, then add it back

Note: Using RemovalPolicy.RETAIN comes with the natural downside of having to clean up dangling resources manually

Additional Information/Context

Additional Risks:

If we update manifests and there is any overlap between the original and subsequent manifests, CloudFormation might invisibly delete parts of a manifest. For example, if manifest version 1.0 is deployed and replaced with manifest version 2.0, the intersecting resources (1.0 ∩ 2.0) will be deleted when cleaning up 1.0.

CDK CLI Version

2.160.0

Framework Version

No response

Node.js Version

18

OS

Amazon Linux 2 x86_64

Language

TypeScript

Language Version

5.0.4

Other information

Sev2: P199049085
Tracking: P200043360
Case ID 173931643600782

@esun74 esun74 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 12, 2025
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Feb 12, 2025
@rantoniuk
Copy link

IMO since you're changing the construct identifier, that's why it gets recreated - that's a standard CDK behaviour and not a no-op.

@pahud pahud self-assigned this Feb 12, 2025
@pahud
Copy link
Contributor

pahud commented Feb 12, 2025

Thank you for the detailed report. After investigating the code, I can confirm this is a significant issue with how CloudFormation's resource replacement sequence interacts with Kubernetes resource management.

Root Cause:
The issue occurs during CloudFormation's resource replacement when overwrite: true:

  1. CloudFormation creates the new manifest (Sleeper1) using kubectl apply, which won't re-create the pod as it's an apply
  2. Both old (Sleeper) and new (Sleeper1) manifests exist in CloudFormation, sharing the same pod and Cloudformation has no context about that.
  3. CloudFormation deletes the old manifest (Sleeper)
  4. Since A and B reference the same Kubernetes resources, deleting Sleeper also deletes the resources that were just applied by Sleeper1

This is what's happening under the hood and you are right:

Short-term Workarounds:

  1. Use RemovalPolicy.RETAIN to prevent the deletion:
const manifest = new eks.KubernetesManifest(cluster, 'MyManifest', {
  // ... other props ...
});
(manifest.node.defaultChild as CfnResource).applyRemovalPolicy(RemovalPolicy.RETAIN);

Long-term Fix: We need to modify how the custom resource handles deletions. Possible approaches:

  1. Add a "force" flag that can be used to skip the delete operation during resource replacement
  2. Implement resource adoption logic in the handler to transfer ownership from old to new manifest
  3. Change the handler to use server-side apply with proper field ownership(not sure if it's possible with CDK)

I will bring this up to the team for further inputs.

@pahud pahud added p2 and removed needs-triage This issue or PR still needs to be triaged. labels Feb 12, 2025
@pahud pahud removed their assignment Feb 12, 2025
@pahud pahud added the effort/medium Medium work item – several days of effort label Feb 12, 2025
@esun74
Copy link
Author

esun74 commented Feb 12, 2025

IMO since you're changing the construct identifier, that's why it gets recreated - that's a standard CDK behaviour and not a no-op.

@rantoniuk I am fine with replacement - a bit of downtime while switching between resources is totally okay. You are correct that this is not really a "no-op" on the CFN side, however, the issue is that the replacement deletes the Kubernetes resources entirely and leaves it in a missing state. Edit: updated the issue to reflect this.

@pahud Appreciate you taking a look :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. effort/medium Medium work item – several days of effort p2
Projects
None yet
Development

No branches or pull requests

3 participants