Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a ManagedNodeGroup update fails, the subsequent pulumi up does not recognize that it needs to update the statefile. #875

Closed
tusharshahrs opened this issue Apr 13, 2023 · 8 comments
Assignees
Labels
customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@tusharshahrs
Copy link

What happened?

Issue with eks ManagedNodeGroups. Changed some poddisruptionbudgets that lead to an update to a ManagedNodeGroup failing. However, after solving the underlying issue Pulumi didn't think it needed to do another update of the ManagedNodeGroup. I had to run a refresh. Only then did it do the update again.

That the MNG update failed and a subsequent pulumi up didn't try to update the MNG anymore.
Essentially there's a MNG update b/c of a LaunchTemplate change from a version 1 to 2.

If a MNG rotation fails and I run another update afterwards and Pulumi doesn't pick it up that the nodegroup rotation actually failed

Expected Behavior

If a MNG rotation fails and I run another update(pulumi up) afterwards, then pulumi pick it up and updates the state file.

Steps to reproduce

pending.

Output of pulumi about

Using pulumi/sdk/v3 v3.61.0 and CLI v3.60.1

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@tusharshahrs tusharshahrs added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Apr 13, 2023
@mikhailshilkov
Copy link
Member

Hey @tusharshahrs thank you for this report. Is there a program that I could run to reproduce the issue?

@mikhailshilkov mikhailshilkov added awaiting-feedback Blocked on input from the author and removed needs-triage Needs attention from the triage team labels Apr 17, 2023
@rquitales
Copy link
Member

@tusharshahrs Bump to see if you had some code that we could reproduce this issue? It'd be great if we had a definitive program that causes the ManagedNodeGroup update to fail. Thanks!

@tusharshahrs
Copy link
Author

Here is the reproduction of the issue.

  1. pulumi about output
pulumi about
CLI          
Version      3.70.0
Go Version   go1.20.4
Go Compiler  gc

Plugins
NAME        VERSION
aws         5.41.0
aws         5.31.0
awsx        1.0.2
docker      3.6.1
eks         1.0.2
kubernetes  3.28.1
nodejs      unknown

Host     
OS       darwin
Version  11.7.7
Arch     x86_64

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/tushar-pulumi-corp
User           tushar-pulumi-corp
Organizations 

Dependencies:
NAME                VERSION
@pulumi/awsx        1.0.2
@pulumi/eks         1.0.2
@pulumi/kubernetes  3.28.1
@pulumi/pulumi      3.70.0
@types/node         16.18.35
@pulumi/aws         5.41.0

Steps

  1. git clone https://github.com/tusharshahrs/pulumi-home/tree/identityprofile

  2. cd aws-classic-ts-eks-launchtemplate-poddisruptionbudget
    Bring up the stack via:

  3. Initialize the stack
    pulumi stack init dev

  4. Install dependencies
    npm install

  5. Set the region
    pulumi config set aws:region us-east-2 # any valid aws region

  6. Launch
    pulumi up -y

  7. Once the stack is up, make the following change:
    Uncomment the following line so that it shows you are adding in the following image id in the launch template.
    Add `imageId: "ami-055c9a441998a8f28"

  8. Run pulumi up -y, and wait about 7-10 minutes

  9. Run pulumi up -y and wait for it to time out on an error ( about 10-25 minutes).
    This will create a NEW nodegroup and it will show this:
    You will see the following error:

        * waiting for EKS Node Group (mydemo-eks-f22490e:mydemo-eksNodeGroup-1908cec) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
        * i-0ef122bffbacb6606, i-0fc7c9d505eed0956: NodeCreationFailure: Instances failed to join the kubernetes cluster
  1. Now we revert the changes so that the poddisruption budget is fixed and we put the instances back to the original size

    • Switch instance type on the launch template via uncommenting out the t3a.small and commenting t3a.nano - this will make the launch template change versions ( which is what we want).
      • Update the poddisruption budget settings to revert the change the minAvailable to an error value via uncommenting out minAvailable: "100%"
        and commenting minAvailable: "101%". This means the poddisruption budget will be working again.
  2. Undo the change in step 10 by commenting out the block for the imageid in the launch template. Make it look like the following: https://github.com/tusharshahrs/pulumi-home/blob/identityprofile/aws-classic-ts-eks-launchtemplate-poddisruptionbudget/index.ts#L60

  3. You'll have to delete the CREATE_Failed nodegroup from the AWS console.

  4. Run pulumi refresh -y once you are done.

  5. Then run pulumi up -y and the original nodegroup will get new instances.

  6. Check the outputs for the nodegroup and it will show something like this:

pulumi stack output eksnodegroup_name

Results:

mydemo-eksNodeGroup-d9647fc
  1. Now check the state file: pulumi stack export --file mystack3.json ( Please rename this from .json.txt to .json, I had to do this to upload the file).
    mystack3.json.txt

and it shows that the launch-template for the instances is:

                    "launchTemplate": {
                        "__defaults": [],
                        "id": "lt-075b435416b002519",
                        "version": "11"
                    },

and that the new launch template doesn't show up in the state file.
For example, this: lt-06c37a2ece3a2afeb] is NOT found in the state file.

The aws console for managed nodes for EKS
Compute___mydemo-eks-f22490e_cluster___Amazon_EKS

now shows that the launch template is stuck on version 11. Now no matter what change I make to my launch template, for example, swapping out the instance size and saving the file, when I run pulumi up, it seems like there are no changes to be made and no new instances are launched. Even though in the AWS console, it shows that you are using a different launch template, and that the nodegroup has started over from 1. However, any change you make to the launch template in the code no longer is reflected.

  1. pulumi up shows no changes now
View in Browser (Ctrl+O): https://app.pulumi.com/tushar-pulumi-corp/aws-classic-ts-eks-launchtemplate-poddisruptionbudget/dev/updates/54

     Type                 Name                                                       Status     
     pulumi:pulumi:Stack  aws-classic-ts-eks-launchtemplate-poddisruptionbudget-dev             


Outputs:
    eksnodegroup_name          : "mydemo-eksNodeGroup-d9647fc"
    kubeconfig                 : [secret]
    mylaunchTemplate_id        : "lt-075b435416b002519"
    mylaunchTemplate_version   : 11
    mynamespace_name           : "mydemo-namespace-e8caec1e"
    myvpc_id                   : "vpc-069fbae2744f8a01c"
    myvpc_private_subnets      : [
        [0]: "subnet-0ea9d0a1aabafba51"
        [1]: "subnet-08c56d65e36341360"
        [2]: "subnet-0db7449c629300e0d"
    ]
    myvpc_public_subnets       : [
        [0]: "subnet-0eba3d99f12d41a25"
        [1]: "subnet-0d9a113e3b7162d87"
        [2]: "subnet-0d9a5fff69cdf1539"
    ]
    pdb_name                   : "mydemo-pdb-60cab9b2"
    securitygroup_eksnode_id   : "sg-0399be12b463618ad"
    securitygroup_eksnode_name : "mydemo-eks-cluster-sg-a6d06e1"
    securitygroup_eksnode_tags : {
        Name: "mydemo-eks-cluster-sg"
    }
    securitygroup_eksnode_vpcid: "vpc-069fbae2744f8a01c"

Resources:
    44 unchanged

Duration: 9s

@mnlumi mnlumi added the customer/feedback Feedback from customers label Jul 20, 2023
@mikhailshilkov mikhailshilkov removed the awaiting-feedback Blocked on input from the author label Jul 26, 2023
@mnlumi mnlumi added the customer/lighthouse Lighthouse customer bugs label Aug 2, 2023
@mikhailshilkov mikhailshilkov added this to the 0.93 milestone Aug 11, 2023
@bradyburke
Copy link

Running into this same issue, are there any plans for a fix?

@mnlumi mnlumi modified the milestones: 0.93, 0.94 Aug 31, 2023
@mikhailshilkov mikhailshilkov modified the milestones: 0.94, 0.95 Oct 2, 2023
@rquitales
Copy link
Member

Based on the testing I've done, I was able to reproduce this issue with the upstream AWS v5 provider. Re-running the test on v6 causes the test to pass, which indicates that upgrading to v6 should help resolve this issue.
Linked here is the repro/test I have to replicate this: https://github.com/rquitales/repro-eks

@lukehoban
Copy link
Contributor

upgrading to v6 should help resolve this issue.

I believe #910 will fix this when it merges.

@mikhailshilkov mikhailshilkov modified the milestones: 0.95, 0.96 Oct 26, 2023
@lukehoban
Copy link
Contributor

I believe #910 will fix this when it merges.

@rquitales @thomas11 Can this be closed out now that EKS 2.0.0 is available/?

@rquitales
Copy link
Member

I was able to successfully avoid this issue with v2 of the EKS provider after running the repro steps twice.
Repro steps:

  1. Create an EKS cluster, ManagedNodeGroup
  2. Create a k8s Deployment on the cluster and a PDB with maxUnavailable=0%
  3. Update the launch template by changing the node size
  4. Update the MNG to point to the new launch template
  5. Expect node group upgrade failure after 45 mins
  6. Update the PDB to allow disruptions (maxUnavailable=100%) and re-run pulumi up without issue
  7. Re-ran steps 2-6 again to confirm that repeated failed MNG upgrades in a stack do not encounter the issue

Repro Pulumi program: https://github.com/rquitales/repro-eks/tree/v2-test
Screenshot to verify successful updates after a failed update:
Screenshot 2023-11-13 at 6 52 03 PM

@pulumi-bot pulumi-bot reopened this Nov 14, 2023
@rquitales rquitales added the resolution/fixed This issue was fixed label Nov 14, 2023
@pulumi pulumi deleted a comment from pulumi-bot Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer/feedback Feedback from customers customer/lighthouse Lighthouse customer bugs kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

7 participants