diff --git a/enhancements/machine-api/machine-health-checking.md b/enhancements/machine-api/machine-health-checking.md index a08dbeed655..978952a8c67 100644 --- a/enhancements/machine-api/machine-health-checking.md +++ b/enhancements/machine-api/machine-health-checking.md @@ -96,6 +96,12 @@ For a node notFound or a failed machine, the machine is considerable unrecoverab - The machine controller provider implementation deletes the cloud instance. - The machine controller deletes the machine resource. +### User Stories + +- I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m. +- I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time. +- I want no remediation to happen while my cluster is upgrading its machines / nodes. + ### Implementation Details #### MachineHealthCheck CRD @@ -104,11 +110,6 @@ For a node notFound or a failed machine, the machine is considerable unrecoverab - Enable setting a threshold of unhealthy nodes. If the current number is at or above this threshold no further remediation will take place. This can be expressed as an int or as a percentage of the total targets in the pool. - Enable pausing of remediation -E.g: -- I want my worker machines to be remediated when the backed node has `ready=false` or `ready=Unknown` condition for more than 10m. -- I want remediation to temporary short-circuit if the 40% or more of the targets of this pool are unhealthy at the same time. -- I want no remediation to happen while my cluster is upgrading its machines / nodes. - ```yaml apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck @@ -190,6 +191,12 @@ This feature will be tested for public clouds in the e2e machine API suite as th ### Graduation Criteria An implementation of this feature is currently gated behind the `TechPreviewNoUpgrade` flag. This proposal wants to remove the gating flag and promote machine health check to a GA status with a beta API. +#### Dev Preview -> Tech Preview + +#### Tech Preview -> GA + +#### Removing a deprecated feature + ### Upgrade / Downgrade Strategy The machine health check controller lives in the machine-api-operator image so the upgrades will be driven by the CVO which will fetch the right image version as usual. See: