From 8ab6b6d91a9bcaa38383c1e429c109b7b1fdf26e Mon Sep 17 00:00:00 2001
From: "Dr. Stefan Schimanski" <stefan.schimanski@gmail.com>
Date: Thu, 14 Oct 2021 17:31:55 +0200
Subject: [PATCH] Add more examples

---
 guidelines/enhancement_template.md | 41 +++++++++++++++++++++++++++---
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/guidelines/enhancement_template.md b/guidelines/enhancement_template.md
index ceeec9ac43a..e2075e10aeb 100644
--- a/guidelines/enhancement_template.md
+++ b/guidelines/enhancement_template.md
@@ -280,10 +280,10 @@ enhancement:
 
 ### Impact of API Extensions
 
-Describe the API extensions here in details, especially their impact on a cluster:
+Describe the API extensions here in detail, especially their impact on a cluster:
  
-- what are the SLIs (Service Level Indicators) an operator can use to determine the health of
-  the API extensions
+- what are the SLIs (Service Level Indicators) an administrator or support can use to 
+- determine the health of the API extensions
 
   Examples: metrics, alerts, operator conditions
 - which impact do these API extensions have on existing SLIs (e.g. scalability, API throughput,
@@ -310,12 +310,45 @@ Describe the API extensions here in details, especially their impact on a cluste
 describe how to
 - detect the failure modes in a support situation, describe possible symptoms (events, metrics,
   alerts, which log output in which component)
-- disable the API extension
+
+  Examples:
+  - if the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
+  - operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed"
+  - the metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
+    will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.
+
+- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)
+    
     - which consequences does it have on the cluster health?
+    
+      Examples:
+      - garbage collection in kube-controller-manager will stop working.
+      - quota will be wrongly computed.
+
     - which consequences does it have on existing, running workloads?
+  
+      Examples:
+      - new namespaces won't get the finalizer "xyz" and hence might leak resource X
+        when deleted
+      - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
+        communication after some minutes.
+      
     - which consequences does it have for newly created workloads?
+  
+      Examples:
+      - new pods in namespace with Istio support will not get sidecars injected, breaking
+        their networking
+
     - does functionality fail gracefully and will resume work when re-enabled without risking
       consistency?
+  
+      Examples:
+      - the mutating admission webhook "xyz" has FailPolicy=Ignore and hence
+        will not block the creation or updates on objects when it fails. And when the
+        webhook comes back online, there is a controller reconciling all objects, applying
+        labels that we not applied during admission during downtime.
+      - namespaces deletion will not delete all objects in etcd, leading to zombie
+        objects when equally named namespace is created.
 
 ## Implementation History