Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: Fast nodegroup backoff on failed provisioning #5548

Merged
merged 1 commit into from
Apr 25, 2023

Conversation

domenicbozzuto
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a case to the Azure VMSS instance status check to cover an instance that failed provisioning. If the instance fails to provision (generally due to Azure being unable to provide capacity for the instance type), the error percolate to the cluster state registry and put the nodegroup into backoff (faster than if it had to wait maxNodeProvisionTime).

This is inspired by the other cloudProviders that implement similar behavior, like #4489

Testing Methodology

An nodegroup using an instance type that was actively experiencing Azure capacity issues was selected. A scale up was triggered for that nodegroup which promptly fails. The nodegroup is put into backoff in ~20s, and another compatible nodegroup can be upscaled instead, without having to wait maxNodeProvisionTime (15m):

image

2023-02-28T14:14:49.615Z: virtualMachineScaleSetsClient.WaitForCreateOrUpdateResult - updateVMSSCapacity for scale set ""<NODEGROUP>"" failed: Code=""ZonalAllocationFailed"" Message=""Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance"" Target=""53""
2023-02-28T14:14:49.615Z: Failed to update the capacity for <NODEGROUP> with error Code=""ZonalAllocationFailed"" Message=""Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance"" Target=""53"", invalidate the cache so as to get the real size from API
2023-02-28T14:14:49.619Z: Provisioning has failed for VM: azure:///subscriptions/<SUBSCRIPTION>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.Compute/virtualMachineScaleSets/<NODEGROUP>/virtualMachines/53
2023-02-28T14:14:49.621Z: Nodegroup is nil for azure:///subscriptions/<SUBSCRIPTION>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.Compute/virtualMachineScaleSets/<NODEGROUP>/virtualMachines/53
2023-02-28T14:14:49.621Z: Disabling scale-up for node group <NODEGROUP> until 2023-02-28 14:19:49.558793898 +0000 UTC m=+59470.373573234; errorClass=OutOfResource; errorCode=provisioning-state-failed

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/cluster-autoscaler labels Feb 28, 2023
@domenicbozzuto domenicbozzuto marked this pull request as ready for review February 28, 2023 17:50
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2023
@gandhipr
Copy link
Contributor

gandhipr commented Mar 1, 2023

/lgtm
/assign @tallaxes

@domenicbozzuto
Copy link
Contributor Author

@tallaxes a gentle bump if you have a chance to take a look at this

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2023
…p sooner

When Azure fails to provision a node for a nodegroup due to an instance capacity issue ((Zonal)AllocationFailed) or other reason, the VMSS size increase is still reflected but the new instance gets the status `ProvisioningStateFailed`. This now bubbles up the error to the `cloudprovider.Instance`, where it can be used by in `clusterstate` to put the nodegroup into backoff sooner.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2023
@bpineau
Copy link
Contributor

bpineau commented Apr 25, 2023

@tallaxes , @feiskyer or @nilo19 any chance one of you could have a look please? 🙇

@tallaxes
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: domenicbozzuto, tallaxes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2023
@tallaxes
Copy link
Contributor

/area provider/azure

@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Apr 25, 2023
@k8s-ci-robot k8s-ci-robot merged commit 0142a57 into kubernetes:master Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants