oci provider: fail fast, recover fast, when instance-pool/node-group is out of capacity #5335

jlamillan · 2022-11-28T23:38:39Z

Which component this PR applies to?

cluster-autoscaler (OCI provider)

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds support to the OCI provider for Instance Pools to fail fast (and potentially recover fast) when an underlying instance-pool for a node-group has run out of quota / capacity.

It accomplishes this by taking a similar approach to this PR where it monitors the cloud-providers work request queue for errors relating to capacity/quota, and quickly fails any on-going scale-up operation if they are detected. Additionally, it creates "placeholder instances" for yet-to-be-fulfilled cloud instances, which allows the Cluster Autoscaler to short-circuit the --max-node-provision-time timeout and gives it (unfulfilled) instances to delete. This approach frees the Cluster Autoscaler to (1) detect the failure fast and (2) try to scale a different node-group to fulfill the missing instances (provided it meets the node requirements).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

These changes have been running in production for a number of weeks.

Does this PR introduce a user-facing change?

support monitoring instance-pool work-requests for capacity/quota issues during scale-up

jlamillan · 2022-11-28T23:42:56Z

FYI @gjtempleton, @x13n thanks for mentioning that AWS and GKE take a similar approach the other week in the Autoscaling SIG.

mwielgus

/lgtm
/approve

k8s-ci-robot · 2022-12-12T15:18:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlamillan, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 28, 2022

k8s-ci-robot requested review from feiskyer and x13n November 28, 2022 23:39

jlamillan added 3 commits November 29, 2022 14:24

Handle pagination when looking through supported shapes.

f2ccfb5

Add OCI API files to handle OCI work-request operations.

c4c611e

Fail fast if OCI instance pool is out of capacity/quota.

fd3fbd0

jlamillan force-pushed the jlamillan/oci-provider-fail-fast-capacity branch from f0d25c6 to fd3fbd0 Compare November 29, 2022 22:24

jbartosik added the area/cluster-autoscaler label Dec 5, 2022

mwielgus approved these changes Dec 12, 2022

View reviewed changes

k8s-ci-robot assigned mwielgus Dec 12, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 12, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 12, 2022

k8s-ci-robot merged commit 3806348 into kubernetes:master Dec 12, 2022

jlamillan mentioned this pull request Dec 12, 2022

REQUEST: New membership for jlamillan kubernetes/org#3891

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oci provider: fail fast, recover fast, when instance-pool/node-group is out of capacity #5335

oci provider: fail fast, recover fast, when instance-pool/node-group is out of capacity #5335

jlamillan commented Nov 28, 2022

jlamillan commented Nov 28, 2022

mwielgus left a comment

k8s-ci-robot commented Dec 12, 2022

oci provider: fail fast, recover fast, when instance-pool/node-group is out of capacity #5335

oci provider: fail fast, recover fast, when instance-pool/node-group is out of capacity #5335

Conversation

jlamillan commented Nov 28, 2022

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

jlamillan commented Nov 28, 2022

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 12, 2022