Azure: Fast nodegroup backoff on failed provisioning #5548

domenicbozzuto · 2023-02-28T15:48:28Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a case to the Azure VMSS instance status check to cover an instance that failed provisioning. If the instance fails to provision (generally due to Azure being unable to provide capacity for the instance type), the error percolate to the cluster state registry and put the nodegroup into backoff (faster than if it had to wait maxNodeProvisionTime).

This is inspired by the other cloudProviders that implement similar behavior, like #4489

Testing Methodology

An nodegroup using an instance type that was actively experiencing Azure capacity issues was selected. A scale up was triggered for that nodegroup which promptly fails. The nodegroup is put into backoff in ~20s, and another compatible nodegroup can be upscaled instead, without having to wait maxNodeProvisionTime (15m):

2023-02-28T14:14:49.615Z: virtualMachineScaleSetsClient.WaitForCreateOrUpdateResult - updateVMSSCapacity for scale set ""<NODEGROUP>"" failed: Code=""ZonalAllocationFailed"" Message=""Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance"" Target=""53""
2023-02-28T14:14:49.615Z: Failed to update the capacity for <NODEGROUP> with error Code=""ZonalAllocationFailed"" Message=""Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance"" Target=""53"", invalidate the cache so as to get the real size from API
2023-02-28T14:14:49.619Z: Provisioning has failed for VM: azure:///subscriptions/<SUBSCRIPTION>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.Compute/virtualMachineScaleSets/<NODEGROUP>/virtualMachines/53
2023-02-28T14:14:49.621Z: Nodegroup is nil for azure:///subscriptions/<SUBSCRIPTION>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.Compute/virtualMachineScaleSets/<NODEGROUP>/virtualMachines/53
2023-02-28T14:14:49.621Z: Disabling scale-up for node group <NODEGROUP> until 2023-02-28 14:19:49.558793898 +0000 UTC m=+59470.373573234; errorClass=OutOfResource; errorCode=provisioning-state-failed

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

gandhipr · 2023-03-01T16:07:16Z

/lgtm
/assign @tallaxes

domenicbozzuto · 2023-03-28T16:25:14Z

@tallaxes a gentle bump if you have a chance to take a look at this

…p sooner When Azure fails to provision a node for a nodegroup due to an instance capacity issue ((Zonal)AllocationFailed) or other reason, the VMSS size increase is still reflected but the new instance gets the status `ProvisioningStateFailed`. This now bubbles up the error to the `cloudprovider.Instance`, where it can be used by in `clusterstate` to put the nodegroup into backoff sooner.

bpineau · 2023-04-25T10:25:34Z

@tallaxes , @feiskyer or @nilo19 any chance one of you could have a look please? 🙇

tallaxes · 2023-04-25T21:34:49Z

/lgtm
/approve

k8s-ci-robot · 2023-04-25T21:34:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: domenicbozzuto, tallaxes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/azure/OWNERS~~ [tallaxes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tallaxes · 2023-04-25T21:35:48Z

/area provider/azure

k8s-ci-robot requested review from feiskyer and tallaxes February 28, 2023 15:48

domenicbozzuto force-pushed the azure-fast-backoff branch from 13b27ac to ad7695a Compare February 28, 2023 15:52

domenicbozzuto marked this pull request as ready for review February 28, 2023 17:50

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2023

k8s-ci-robot requested review from gandhipr and nilo19 February 28, 2023 17:50

k8s-ci-robot assigned tallaxes Mar 1, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2023

domenicbozzuto force-pushed the azure-fast-backoff branch from ad7695a to 066315c Compare April 24, 2023 17:57

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2023

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Apr 25, 2023

k8s-ci-robot merged commit 0142a57 into kubernetes:master Apr 25, 2023

domenicbozzuto mentioned this pull request May 17, 2023

Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure: Fast nodegroup backoff on failed provisioning #5548

Azure: Fast nodegroup backoff on failed provisioning #5548

domenicbozzuto commented Feb 28, 2023

gandhipr commented Mar 1, 2023 •

edited

Loading

domenicbozzuto commented Mar 28, 2023

bpineau commented Apr 25, 2023

tallaxes commented Apr 25, 2023

k8s-ci-robot commented Apr 25, 2023

tallaxes commented Apr 25, 2023

Azure: Fast nodegroup backoff on failed provisioning #5548

Azure: Fast nodegroup backoff on failed provisioning #5548

Conversation

domenicbozzuto commented Feb 28, 2023

What type of PR is this?

What this PR does / why we need it:

Testing Methodology

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

gandhipr commented Mar 1, 2023 • edited Loading

domenicbozzuto commented Mar 28, 2023

bpineau commented Apr 25, 2023

tallaxes commented Apr 25, 2023

k8s-ci-robot commented Apr 25, 2023

tallaxes commented Apr 25, 2023

gandhipr commented Mar 1, 2023 •

edited

Loading