From 8ab6b6d91a9bcaa38383c1e429c109b7b1fdf26e Mon Sep 17 00:00:00 2001 From: "Dr. Stefan Schimanski" Date: Thu, 14 Oct 2021 17:31:55 +0200 Subject: [PATCH] Add more examples --- guidelines/enhancement_template.md | 41 +++++++++++++++++++++++++++--- 1 file changed, 37 insertions(+), 4 deletions(-) diff --git a/guidelines/enhancement_template.md b/guidelines/enhancement_template.md index ceeec9ac43a..e2075e10aeb 100644 --- a/guidelines/enhancement_template.md +++ b/guidelines/enhancement_template.md @@ -280,10 +280,10 @@ enhancement: ### Impact of API Extensions -Describe the API extensions here in details, especially their impact on a cluster: +Describe the API extensions here in detail, especially their impact on a cluster: -- what are the SLIs (Service Level Indicators) an operator can use to determine the health of - the API extensions +- what are the SLIs (Service Level Indicators) an administrator or support can use to +- determine the health of the API extensions Examples: metrics, alerts, operator conditions - which impact do these API extensions have on existing SLIs (e.g. scalability, API throughput, @@ -310,12 +310,45 @@ Describe the API extensions here in details, especially their impact on a cluste describe how to - detect the failure modes in a support situation, describe possible symptoms (events, metrics, alerts, which log output in which component) -- disable the API extension + + Examples: + - if the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz". + - operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed" + - the metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")` + will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire. + +- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`) + - which consequences does it have on the cluster health? + + Examples: + - garbage collection in kube-controller-manager will stop working. + - quota will be wrongly computed. + - which consequences does it have on existing, running workloads? + + Examples: + - new namespaces won't get the finalizer "xyz" and hence might leak resource X + when deleted + - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod + communication after some minutes. + - which consequences does it have for newly created workloads? + + Examples: + - new pods in namespace with Istio support will not get sidecars injected, breaking + their networking + - does functionality fail gracefully and will resume work when re-enabled without risking consistency? + + Examples: + - the mutating admission webhook "xyz" has FailPolicy=Ignore and hence + will not block the creation or updates on objects when it fails. And when the + webhook comes back online, there is a controller reconciling all objects, applying + labels that we not applied during admission during downtime. + - namespaces deletion will not delete all objects in etcd, leading to zombie + objects when equally named namespace is created. ## Implementation History