If a ManagedNodeGroup update fails, the subsequent pulumi up does not recognize that it needs to update the statefile. #875

tusharshahrs · 2023-04-13T19:45:26Z

What happened?

Issue with eks ManagedNodeGroups. Changed some poddisruptionbudgets that lead to an update to a ManagedNodeGroup failing. However, after solving the underlying issue Pulumi didn't think it needed to do another update of the ManagedNodeGroup. I had to run a refresh. Only then did it do the update again.

That the MNG update failed and a subsequent pulumi up didn't try to update the MNG anymore.
Essentially there's a MNG update b/c of a LaunchTemplate change from a version 1 to 2.

If a MNG rotation fails and I run another update afterwards and Pulumi doesn't pick it up that the nodegroup rotation actually failed

Expected Behavior

If a MNG rotation fails and I run another update(pulumi up) afterwards, then pulumi pick it up and updates the state file.

Steps to reproduce

pending.

Output of `pulumi about`

Using pulumi/sdk/v3 v3.61.0 and CLI v3.60.1

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

mikhailshilkov · 2023-04-17T08:52:31Z

Hey @tusharshahrs thank you for this report. Is there a program that I could run to reproduce the issue?

rquitales · 2023-05-19T00:03:45Z

@tusharshahrs Bump to see if you had some code that we could reproduce this issue? It'd be great if we had a definitive program that causes the ManagedNodeGroup update to fail. Thanks!

tusharshahrs · 2023-06-14T16:53:19Z

Here is the reproduction of the issue.

pulumi about output

pulumi about
CLI          
Version      3.70.0
Go Version   go1.20.4
Go Compiler  gc

Plugins
NAME        VERSION
aws         5.41.0
aws         5.31.0
awsx        1.0.2
docker      3.6.1
eks         1.0.2
kubernetes  3.28.1
nodejs      unknown

Host     
OS       darwin
Version  11.7.7
Arch     x86_64

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/tushar-pulumi-corp
User           tushar-pulumi-corp
Organizations 

Dependencies:
NAME                VERSION
@pulumi/awsx        1.0.2
@pulumi/eks         1.0.2
@pulumi/kubernetes  3.28.1
@pulumi/pulumi      3.70.0
@types/node         16.18.35
@pulumi/aws         5.41.0

Steps

git clone https://github.com/tusharshahrs/pulumi-home/tree/identityprofile
cd aws-classic-ts-eks-launchtemplate-poddisruptionbudget
Bring up the stack via:
Initialize the stack
pulumi stack init dev
Install dependencies
npm install
Set the region
pulumi config set aws:region us-east-2 # any valid aws region
Launch
pulumi up -y
Once the stack is up, make the following change:
Uncomment the following line so that it shows you are adding in the following image id in the launch template.
Add `imageId: "ami-055c9a441998a8f28"
Run pulumi up -y, and wait about 7-10 minutes
Run pulumi up -y and wait for it to time out on an error ( about 10-25 minutes).
This will create a NEW nodegroup and it will show this:
You will see the following error:

        * waiting for EKS Node Group (mydemo-eks-f22490e:mydemo-eksNodeGroup-1908cec) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
        * i-0ef122bffbacb6606, i-0fc7c9d505eed0956: NodeCreationFailure: Instances failed to join the kubernetes cluster

Now we revert the changes so that the poddisruption budget is fixed and we put the instances back to the original size
- Switch instance type on the launch template via uncommenting out the t3a.small and commenting t3a.nano - this will make the launch template change versions ( which is what we want).
  - Update the poddisruption budget settings to revert the change the minAvailable to an error value via uncommenting out minAvailable: "100%"
    and commenting minAvailable: "101%". This means the poddisruption budget will be working again.
Undo the change in step 10 by commenting out the block for the imageid in the launch template. Make it look like the following: https://github.com/tusharshahrs/pulumi-home/blob/identityprofile/aws-classic-ts-eks-launchtemplate-poddisruptionbudget/index.ts#L60
You'll have to delete the CREATE_Failed nodegroup from the AWS console.
Run pulumi refresh -y once you are done.
Then run pulumi up -y and the original nodegroup will get new instances.
Check the outputs for the nodegroup and it will show something like this:

pulumi stack output eksnodegroup_name

Results:

mydemo-eksNodeGroup-d9647fc

Now check the state file: pulumi stack export --file mystack3.json ( Please rename this from .json.txt to .json, I had to do this to upload the file).
mystack3.json.txt

and it shows that the launch-template for the instances is:

                    "launchTemplate": {
                        "__defaults": [],
                        "id": "lt-075b435416b002519",
                        "version": "11"
                    },

and that the new launch template doesn't show up in the state file.
For example, this: lt-06c37a2ece3a2afeb] is NOT found in the state file.

The aws console for managed nodes for EKS

now shows that the launch template is stuck on version 11. Now no matter what change I make to my launch template, for example, swapping out the instance size and saving the file, when I run pulumi up, it seems like there are no changes to be made and no new instances are launched. Even though in the AWS console, it shows that you are using a different launch template, and that the nodegroup has started over from 1. However, any change you make to the launch template in the code no longer is reflected.

pulumi up shows no changes now

View in Browser (Ctrl+O): https://app.pulumi.com/tushar-pulumi-corp/aws-classic-ts-eks-launchtemplate-poddisruptionbudget/dev/updates/54

     Type                 Name                                                       Status     
     pulumi:pulumi:Stack  aws-classic-ts-eks-launchtemplate-poddisruptionbudget-dev             


Outputs:
    eksnodegroup_name          : "mydemo-eksNodeGroup-d9647fc"
    kubeconfig                 : [secret]
    mylaunchTemplate_id        : "lt-075b435416b002519"
    mylaunchTemplate_version   : 11
    mynamespace_name           : "mydemo-namespace-e8caec1e"
    myvpc_id                   : "vpc-069fbae2744f8a01c"
    myvpc_private_subnets      : [
        [0]: "subnet-0ea9d0a1aabafba51"
        [1]: "subnet-08c56d65e36341360"
        [2]: "subnet-0db7449c629300e0d"
    ]
    myvpc_public_subnets       : [
        [0]: "subnet-0eba3d99f12d41a25"
        [1]: "subnet-0d9a113e3b7162d87"
        [2]: "subnet-0d9a5fff69cdf1539"
    ]
    pdb_name                   : "mydemo-pdb-60cab9b2"
    securitygroup_eksnode_id   : "sg-0399be12b463618ad"
    securitygroup_eksnode_name : "mydemo-eks-cluster-sg-a6d06e1"
    securitygroup_eksnode_tags : {
        Name: "mydemo-eks-cluster-sg"
    }
    securitygroup_eksnode_vpcid: "vpc-069fbae2744f8a01c"

Resources:
    44 unchanged

Duration: 9s

bradyburke · 2023-08-24T15:16:28Z

Running into this same issue, are there any plans for a fix?

rquitales · 2023-10-13T18:51:49Z

Based on the testing I've done, I was able to reproduce this issue with the upstream AWS v5 provider. Re-running the test on v6 causes the test to pass, which indicates that upgrading to v6 should help resolve this issue.
Linked here is the repro/test I have to replicate this: https://github.com/rquitales/repro-eks

lukehoban · 2023-10-16T16:13:11Z

upgrading to v6 should help resolve this issue.

I believe #910 will fix this when it merges.

lukehoban · 2023-11-13T02:34:46Z

I believe #910 will fix this when it merges.

@rquitales @thomas11 Can this be closed out now that EKS 2.0.0 is available/?

rquitales · 2023-11-14T02:52:37Z

I was able to successfully avoid this issue with v2 of the EKS provider after running the repro steps twice.
Repro steps:

Create an EKS cluster, ManagedNodeGroup
Create a k8s Deployment on the cluster and a PDB with maxUnavailable=0%
Update the launch template by changing the node size
Update the MNG to point to the new launch template
Expect node group upgrade failure after 45 mins
Update the PDB to allow disruptions (maxUnavailable=100%) and re-run pulumi up without issue
Re-ran steps 2-6 again to confirm that repeated failed MNG upgrades in a stack do not encounter the issue

Repro Pulumi program: https://github.com/rquitales/repro-eks/tree/v2-test
Screenshot to verify successful updates after a failed update:

tusharshahrs added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Apr 13, 2023

mikhailshilkov added awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels Apr 17, 2023

mnlumi added the customer/feedback Feedback from customers label Jul 20, 2023

mikhailshilkov removed the awaiting-feedback Blocked on input from the author label Jul 26, 2023

mikhailshilkov assigned rquitales Jul 28, 2023

mnlumi added the customer/lighthouse Lighthouse customer bugs label Aug 2, 2023

mikhailshilkov added this to the 0.93 milestone Aug 11, 2023

mnlumi modified the milestones: 0.93, 0.94 Aug 31, 2023

mikhailshilkov modified the milestones: 0.94, 0.95 Oct 2, 2023

mikhailshilkov modified the milestones: 0.95, 0.96 Oct 26, 2023

rquitales closed this as completed Nov 14, 2023

pulumi-bot reopened this Nov 14, 2023

rquitales added the resolution/fixed This issue was fixed label Nov 14, 2023

pulumi deleted a comment from pulumi-bot Nov 14, 2023

lukehoban closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If a ManagedNodeGroup update fails, the subsequent pulumi up does not recognize that it needs to update the statefile. #875

If a ManagedNodeGroup update fails, the subsequent pulumi up does not recognize that it needs to update the statefile. #875

tusharshahrs commented Apr 13, 2023

mikhailshilkov commented Apr 17, 2023

rquitales commented May 19, 2023

tusharshahrs commented Jun 14, 2023

bradyburke commented Aug 24, 2023

rquitales commented Oct 13, 2023

lukehoban commented Oct 16, 2023

lukehoban commented Nov 13, 2023

rquitales commented Nov 14, 2023

If a ManagedNodeGroup update fails, the subsequent pulumi up does not recognize that it needs to update the statefile. #875

If a ManagedNodeGroup update fails, the subsequent pulumi up does not recognize that it needs to update the statefile. #875

Comments

tusharshahrs commented Apr 13, 2023

What happened?

Expected Behavior

Steps to reproduce

Output of pulumi about

Additional context

Contributing

mikhailshilkov commented Apr 17, 2023

rquitales commented May 19, 2023

tusharshahrs commented Jun 14, 2023

bradyburke commented Aug 24, 2023

rquitales commented Oct 13, 2023

lukehoban commented Oct 16, 2023

lukehoban commented Nov 13, 2023

rquitales commented Nov 14, 2023

Output of `pulumi about`