Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWSMachinePool reconciliation stuck if ASG could not be created #4655

Closed
AndiDog opened this issue Nov 23, 2023 · 1 comment · Fixed by #4660 or #4662
Closed

AWSMachinePool reconciliation stuck if ASG could not be created #4655

AndiDog opened this issue Nov 23, 2023 · 1 comment · Fixed by #4660 or #4662
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@AndiDog
Copy link
Contributor

AndiDog commented Nov 23, 2023

/kind bug

What steps did you take and what happened:

We created a new cluster having a AWSMachinePool. Upon adding a second AWSMachinePool, for some reason, the IAM instance profile wasn't ready/found yet and the ASG creation failed. That's out of scope for the issue. Just mentioning the error for completeness and visibility:

failed to create AWSMachinePool: failed to create autoscaling group: ValidationError: You must use a valid fully-formed launch template. Value (nodes-nodepool2-theclustername) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name

This means that the AWSMachinePool reconciler had successfully created the launch template (and set the IAM instance profile name on the launch template), but then failed on ASG creation.

The problem is that reconciliation is now stuck due to a bug: on next Reconcile attempt, ReconcileLaunchTemplate detects a diff, calls canUpdateLaunchTemplate and that in turn calls CanStartASGInstanceRefresh. In there, an AWS request DescribeInstanceRefreshesInput{AutoScalingGroupName: aws.String(scope.Name())} is made and fails because the ASG doesn’t exist yet. The error is:

E1123 13:38:44.326581       1 logger.go:83] "failed to reconcile launch template" err=<
	ValidationError: AutoScalingGroup name not found - AutoScalingGroup theclustername-nodepool2 not found
		status code: 400, request id: 17090c90-0275-40dd-9a1d-2e2994105934

We must handle, log and ignore that error. If the ASG does not exist, CanStartASGInstanceRefresh should return false. Or ReconcileLaunchTemplate should not even consider instance refresh through some other means.

What did you expect to happen:

Reconciliation continues and eventually succeeds.

Environment:

  • Cluster-api-provider-aws version: roughly v2.2.4
@richardcase
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
3 participants