aws-eks: Updating KubernetesManifest deletes it instead #33406

esun74 · 2025-02-12T06:06:30Z

Describe the bug

Updating a KubernetesManifest resource through CDK can actually cause it to get deleted.

During a resource replacement, if overwrite: true and the previous manifest has any overlap with the new manifest, the overlapping section would be lost. When the manifest is unchanged, the entire resource is deleted. Issue cannot be mitigated by a rollback or code revert and will repeat on any subsequent update.

Regression Issue

Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

Replacing a KubernetesManifest should at most delete and re-create the underlying EKS manifest resources. A minimal update to a KubernetesManifest should not result in a loss of cluster functionality resulting from missing Kubernetes resources.

Current Behavior

Updates to a KubernetesManifest which are applied as a replacement cause cluster resources to be wiped. Rollbacks and reverts do not bring the cluster back to a healthy state.

Given Manifest A (previous) and Manifest B (new) are based on the same yaml, replacing the KubernetesManifest resource looks like this:

Cloudformation first applies Manifest B, which overwrites Manifest A in EKS; nothing happens basically
Manifest A and B both exist in the cloudformation stack, manifest contents are correctly configured in EKS
Cloudformation deletes Manifest A, which deletes the manifest resources from EKS
Cloudformation now has "updated" to Manifest B, but nothing is in EKS anymore

Reproduction Steps

Setup

new eks.KubernetesManifest(cluster, 'Sleeper', {
  manifest: [
    {
      apiVersion: 'v1',
      kind: 'Pod',
      metadata: {
        name: 'test-sleeper',
      },
      spec: {
        containers: [
          {
            name: 'sleeper',
            image: 'alpine:latest',
            imagePullPolicy: 'Always',
            command: ['/bin/sleep', 'infinity'],
          },
        ],
      },
    },
  ],
  cluster,
  overwrite: true,
});

> kubectl get pods

NAME           READY   STATUS    RESTARTS   AGE
test-sleeper   1/1     Running   0          40s

Minimal Change

- new eks.KubernetesManifest(cluster, 'Sleeper', {
+ new eks.KubernetesManifest(cluster, 'Sleeper1', {

CloudFormation Events

Timestamp	Logical ID	Status
2025-02-11 13:49:18 UTC-0800	ClusterSleeper0E1728F7	DELETE_COMPLETE
2025-02-11 13:48:38 UTC-0800	ClusterSleeper0E1728F7	DELETE_IN_PROGRESS
2025-02-11 13:48:37 UTC-0800	`<stack>`	UPDATE_COMPLETE_CLEANUP_IN_PROGRESS
2025-02-11 13:48:17 UTC-0800	ClusterSleeper1A9127B4A	CREATE_COMPLETE
2025-02-11 13:48:17 UTC-0800	ClusterSleeper1A9127B4A	CREATE_IN_PROGRESS (Resource creation Initiated)
2025-02-11 13:48:05 UTC-0800	ClusterSleeper1A9127B4A	CREATE_IN_PROGRESS

> kubectl get pods

No resources found in default namespace.

Reverts Are Ineffective

- new eks.KubernetesManifest(cluster, 'Sleeper1', {
+ new eks.KubernetesManifest(cluster, 'Sleeper', {

Similar events to above, sleeper pod is created then deleted again.

> kubectl get pods

No resources found in default namespace.

Possible Solution

Immediate Mitigating Options:

Trigger a minimal replacement and set the manifest's deletion policy to RETAIN

(manifest.node.defaultChild as CfnResource).applyRemovalPolicy(RemovalPolicy.RETAIN);

Remove the manifest from CDK entirely, deploy, then add it back

Note: Using RemovalPolicy.RETAIN comes with the natural downside of having to clean up dangling resources manually

Additional Information/Context

Additional Risks:

If we update manifests and there is any overlap between the original and subsequent manifests, CloudFormation might invisibly delete parts of a manifest. For example, if manifest version 1.0 is deployed and replaced with manifest version 2.0, the intersecting resources (1.0 ∩ 2.0) will be deleted when cleaning up 1.0.

CDK CLI Version

2.160.0

Framework Version

No response

Node.js Version

18

OS

Amazon Linux 2 x86_64

Language

TypeScript

Language Version

5.0.4

Other information

Sev2: P199049085
Tracking: P200043360
Case ID 173931643600782

The text was updated successfully, but these errors were encountered:

rantoniuk · 2025-02-12T16:24:31Z

IMO since you're changing the construct identifier, that's why it gets recreated - that's a standard CDK behaviour and not a no-op.

pahud · 2025-02-12T18:40:01Z

Thank you for the detailed report. After investigating the code, I can confirm this is a significant issue with how CloudFormation's resource replacement sequence interacts with Kubernetes resource management.

Root Cause:
The issue occurs during CloudFormation's resource replacement when overwrite: true:

CloudFormation creates the new manifest (Sleeper1) using kubectl apply, which won't re-create the pod as it's an apply
Both old (Sleeper) and new (Sleeper1) manifests exist in CloudFormation, sharing the same pod and Cloudformation has no context about that.
CloudFormation deletes the old manifest (Sleeper)
Since A and B reference the same Kubernetes resources, deleting Sleeper also deletes the resources that were just applied by Sleeper1

This is what's happening under the hood and you are right:

Short-term Workarounds:

Use RemovalPolicy.RETAIN to prevent the deletion:

const manifest = new eks.KubernetesManifest(cluster, 'MyManifest', {
  // ... other props ...
});
(manifest.node.defaultChild as CfnResource).applyRemovalPolicy(RemovalPolicy.RETAIN);

Long-term Fix: We need to modify how the custom resource handles deletions. Possible approaches:

Add a "force" flag that can be used to skip the delete operation during resource replacement
Implement resource adoption logic in the handler to transfer ownership from old to new manifest
Change the handler to use server-side apply with proper field ownership(not sure if it's possible with CDK)

I will bring this up to the team for further inputs.

esun74 · 2025-02-12T19:00:45Z

IMO since you're changing the construct identifier, that's why it gets recreated - that's a standard CDK behaviour and not a no-op.

@rantoniuk I am fine with replacement - a bit of downtime while switching between resources is totally okay. You are correct that this is not really a "no-op" on the CFN side, however, the issue is that the replacement deletes the Kubernetes resources entirely and leaves it in a missing state. Edit: updated the issue to reflect this.

@pahud Appreciate you taking a look :)

esun74 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 12, 2025

github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Feb 12, 2025

pahud self-assigned this Feb 12, 2025

pahud added p2 and removed needs-triage This issue or PR still needs to be triaged. labels Feb 12, 2025

pahud removed their assignment Feb 12, 2025

pahud added the effort/medium Medium work item – several days of effort label Feb 12, 2025

github-actions bot mentioned this issue Mar 1, 2025

Monthly issue metrics report #33660

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-eks: Updating KubernetesManifest deletes it instead #33406

aws-eks: Updating KubernetesManifest deletes it instead #33406

esun74 commented Feb 12, 2025 •

edited

Loading

rantoniuk commented Feb 12, 2025

pahud commented Feb 12, 2025 •

edited

Loading

esun74 commented Feb 12, 2025 •

edited

Loading

aws-eks: Updating KubernetesManifest deletes it instead #33406

aws-eks: Updating KubernetesManifest deletes it instead #33406

Comments

esun74 commented Feb 12, 2025 • edited Loading

Describe the bug

Regression Issue

Last Known Working CDK Version

Expected Behavior

Current Behavior

Reproduction Steps

Setup

Minimal Change

Reverts Are Ineffective

Possible Solution

Additional Information/Context

CDK CLI Version

Framework Version

Node.js Version

OS

Language

Language Version

Other information

rantoniuk commented Feb 12, 2025

pahud commented Feb 12, 2025 • edited Loading

esun74 commented Feb 12, 2025 • edited Loading

esun74 commented Feb 12, 2025 •

edited

Loading

pahud commented Feb 12, 2025 •

edited

Loading

esun74 commented Feb 12, 2025 •

edited

Loading