-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure: Fast nodegroup backoff on failed provisioning #5548
Conversation
13b27ac
to
ad7695a
Compare
/lgtm |
@tallaxes a gentle bump if you have a chance to take a look at this |
…p sooner When Azure fails to provision a node for a nodegroup due to an instance capacity issue ((Zonal)AllocationFailed) or other reason, the VMSS size increase is still reflected but the new instance gets the status `ProvisioningStateFailed`. This now bubbles up the error to the `cloudprovider.Instance`, where it can be used by in `clusterstate` to put the nodegroup into backoff sooner.
ad7695a
to
066315c
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: domenicbozzuto, tallaxes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/area provider/azure |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds a case to the Azure VMSS instance status check to cover an instance that failed provisioning. If the instance fails to provision (generally due to Azure being unable to provide capacity for the instance type), the error percolate to the cluster state registry and put the nodegroup into backoff (faster than if it had to wait
maxNodeProvisionTime
).This is inspired by the other cloudProviders that implement similar behavior, like #4489
Testing Methodology
An nodegroup using an instance type that was actively experiencing Azure capacity issues was selected. A scale up was triggered for that nodegroup which promptly fails. The nodegroup is put into backoff in ~20s, and another compatible nodegroup can be upscaled instead, without having to wait
maxNodeProvisionTime
(15m):Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: