dns: Add configurable-dns-pod-placement enhancement

openshift · Feb 23, 2021 · 1dea562 · 1dea562
1 parent 8090781
commit 1dea562
Showing 1 changed file with 306 additions and 0 deletions.
diff --git a/enhancements/dns/configurable-dns-pod-placement.md b/enhancements/dns/configurable-dns-pod-placement.md
@@ -0,0 +1,306 @@
+---
+title: configurable-dns-pod-placement
+authors:
+  - "@Miciah"
+reviewers:
+  - "@candita"
+  - "@danehans"
+  - "@frobware"
+  - "@knobunc"
+  - "@miheer"
+  - "@rfredette"
+  - "@sgreene570"
+approvers:
+  - "@danehans"
+  - "@frobware"
+  - "@knobunc"
+creation-date: 2021-02-23
+last-updated: 2021-02-23
+status: implementable
+see-also: 
+replaces:
+superseded-by:
+---
+
+# Configurable DNS Pod Placement
+
+This enhancement enables cluster administrators to configure the placement of
+the CoreDNS Pods that provide cluster DNS service.
+
+## Release Signoff Checklist
+
+- [X] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [X] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+The DNS operator in OpenShift 4.7 and prior versions manages a DaemonSet that
+serves two functions: running CoreDNS and managing node hosts' `/etc/hosts`
+files.  This enhancement, in OpenShift 4.8, replaces this single DaemonSet with
+two: one DaemonSet for CoreDNS and one DaemonSet for managing `/etc/hosts`.
+Additionally, this enhancement adds an API to enable cluster administrators to
+configure the placement of the CoreDNS Pods.
+
+## Motivation
+
+OpenShift 4.7 uses a single DaemonSet for both CoreDNS and for managing node
+hosts' `/etc/hosts` files.  Specifically, this DaemonSet has a container that
+adds an entry for the cluster image registry to `/etc/hosts` to enable the
+container runtime (which does not use the cluster DNS service) to resolve and
+thus pull from the cluster image registry.
+
+Because `/etc/hosts` needs to be managed on every node host, this DaemonSet must
+run on every node host.  Moreover, management of `/etc/hosts` is a critical
+service because the node host may fail to pull images (including those of core
+components) absent an entry for the cluster image registry in `/etc/hosts`.
+Consequently the DaemonSet has a toleration for all taints so that the DNS Pod
+always runs on all nodes.
+
+Some cluster administrators require the ability to configure DNS not to run on
+certain nodes.  For example, security policies may prohibit communication
+between certain pairs of nodes; a DNS query from an arbitrary Pod on some node A
+to the DNS Pod on some other node B might fail if some security policy prohibits
+communication between node A and node B.
+
+Splitting CoreDNS and management of `/etc/hosts` into separate DaemonSets makes
+it possible to remove the blanket toleration for all taints from the CoreDNS
+DaemonSet while keeping the blanket toleration on the DaemonSet that manages
+`/etc/hosts`.  Splitting the DaemonSet also makes it possible to enable use of a
+custom node selector on the CoreDNS DaemonSet.
+
+### Goals
+
+1. Separate CoreDNS and management of `/etc/hosts` into separate DaemonSets.
+2. Enable cluster administrators to control where the CoreDNS DaemonSet is scheduled.
+
+### Non-Goals
+
+1. Enable cluster administrators to control the placement of the DaemonSet that manages `/etc/hosts`.
+2. Enforce security policies.
+
+## Proposal
+
+This enhancement has two distinct parts.  First, the DNS operator, which manages
+the "dns-default" DaemonSet, is modified to manage an additional "node-resolver"
+DaemonSet, and the "dns-node-resolver" container, which manages `/etc/hosts`, is
+moved from the "dns-default" DaemonSet to a new "node-resolver" DaemonSet.  As
+part of this change, the toleration for all taints is removed from the
+"dns-default" DaemonSet.  From the cluster administrator's perspective, this
+DaemonSet split is an internal change.
+
+Second, a new API is provided to enable cluster administrators to specify the
+desired placement of the "dns-default" DaemonSet's Pods, which, due to the first
+change, only run CoreDNS and no longer must be scheduled to every node.  This
+new API is the user-facing part of this enhancement.
+
+The DNS operator API is extended by adding an optional `NodePlacement` field
+with type `DNSNodePlacement` to `DNSSpec`:
+
+```go
+// DNSSpec is the specification of the desired behavior of the DNS.
+type DNSSpec struct {
+    // ...
+
+	// nodePlacement enables explicit control over the scheduling of DNS pods.
+	//
+	// If unset, defaults are used. See nodePlacement for more details.
+	//
+	// +optional
+	NodePlacement DNSNodePlacement `json:"nodePlacement,omitempty"`
+}
+```
+
+The `DNSNodePlacement` type has fields to specify a node selector and
+tolerations:
+
+```go
+// DNSNodePlacement describes node scheduling configuration for DNS pods.
+type DNSNodePlacement struct {
+	// nodeSelector is the node selector applied to DNS pods.
+	//
+	// If unset, the default is the following:
+	//
+	//   beta.kubernetes.io/os: linux
+	//
+	// If set, the specified selector is used and replaces the default.
+	//
+	// +optional
+	NodeSelector *metav1.LabelSelector `json:"nodeSelector,omitempty"`
+
+	// tolerations is a list of tolerations applied to DNS pods.
+	//
+	// The default is an empty list.
+	//
+	// See https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
+	//
+	// +optional
+	Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
+}
+```
+
+By default, DNS Pods run on untainted Linux nodes.  The `NodePlacement` field
+enables cluster administrators to specify alternative parameters.  For example,
+the following DNS specifies that DNS Pods should run only on "infra" nodes
+(i.e., nodes that have the "node-role.kubernetes.io/infra" label):
+
+```yaml
+apiVersion: operator.openshift.io/v1
+kind: DNS
+metadata:
+  name: default
+spec:
+  nodePlacement:
+    nodeSelector:
+      matchLabels:
+        node-role.kubernetes.io/infra: ""
+```
+
+### Validation
+
+Omitting `spec.nodePlacement` or its subfields specifies the default behavior.
+
+The API validates that `spec.nodePlacement.nodeSelector`, if specified, is a
+valid node selector and that `spec.nodePlacement.tolerations`, if specified, is
+a list of valid tolerations.
+
+### User Stories
+
+#### As a cluster administrator, I must comply with a security policy that prohibits communication among worker nodes
+
+To satisfy this use-case, the cluster administrator can specify a node selector
+that includes only control-plane nodes, using the new
+`spec.nodePlacement.nodeSelector` API field as follows:
+
+```yaml
+apiVersion: operator.openshift.io/v1
+kind: DNS
+metadata:
+  name: default
+spec:
+  nodePlacement:
+    nodeSelector:
+      matchLabels:
+        node-role.kubernetes.io/master: ""
+```
+
+#### As a cluster administrator, I want to allow DNS Pods to run on nodes that have a taint that has key "dns-only" and effect `NoSchedule`
+
+To satisfy this use-case, the cluster administrator can specify a toleration for
+the taint in question as follows:
+
+```yaml
+apiVersion: operator.openshift.io/v1
+kind: DNS
+metadata:
+  name: default
+spec:
+  nodePlacement:
+    tolerations:
+    - effect: NoSchedule
+      key: "dns-only"
+      operator: Exists
+```
+
+### Implementation Details
+
+Implementing this enhancement requires changes in the following repositories:
+
+* openshift/api
+* openshift/cluster-dns-operator
+
+The DNS operator is modified to manage both the "dns-default" DaemonSet and the
+"node-resolver" DaemonSet, both in the "openshift-dns" namespace.  The
+"dns-node-resolver" container is removed from the "dns-default" DaemonSet if it
+already exists, as are any tolerations and label selectors that are not
+configured per the new API.  The operator is modified to apply the configured
+tolerations and node label selectors to the "dns-default" DaemonSet.  The
+"dns-node-resolver" is configured to tolerate all taints (as the "dns-default"
+DaemonSet does in OpenShift 4.7 and earlier) and run on all Linux nodes.
+
+### Risks and Mitigations
+
+A cluster administrator could configure a node selector, or taint all nodes, in
+such a way that DNS Pods could not be scheduled to any node, rendering the DNS
+service unavailable.
+
+Because the DNS service is critical to other cluster components including OAuth,
+fixing misconfigured DNS Pod placement parameters could be impossible for the
+cluster administrator to do.
+
+As a mitigation to this risk, the DNS operator could verify the desired node
+placement parameters before applying them by listing nodes and verifying that at
+least one matched the specified criteria.  This mitigation would have the
+drawback that it would not help if nodes were tainted or relabeled such that DNS
+Pods were removed after the DNS Pod placement had been configured.
+
+As an alternative or complementary mitigation, the DNS operator could revert the
+"default-dns" DaemonSet to the default node selector and a blanket toleration
+for all taints if it detected that no DNS Pod were scheduled to any node.
+
+## Design Details
+
+### Test Plan
+
+Unit tests are added to verify the functionality of the new API.  Additionally,
+an end-to-end test is added that configures an invalid node selector; the test
+then verifies that the operator detects that the node selector prohibits
+scheduling any Pods and that the operator reverts the change.
+
+### Graduation Criteria
+
+N/A.
+
+### Upgrade / Downgrade Strategy
+
+On upgrade, the DNS operator removes the "dns-node-resolver" container from the
+existing "dns-default" DaemonSet and creates a new "node-resolver" DaemonSet.
+
+On downgrade, the DNS operator may restore the "dns-node-resolver" container in
+the "dns-default" DaemonSet and leave the "node-resolver" DaemonSet, which would
+redundantly update `/etc/hosts`.  However, both "dns-node-resolver" containers
+should write the same content to `/etc/hosts`, and they write the file
+atomically, so the redundant updates should not cause conflicts.
+
+### Version Skew Strategy
+
+N/A.
+
+## Implementation History
+
+- 2018-10-05, in OCP 4.0, [openshift/cluster-dns-operator#34 update resources to
+  avoid openshift cycles by
+  deads2k](https://github.com/openshift/cluster-dns-operator/pull/34) added a
+  blanket toleration for all taints.
+- 2019-11-06, in OCP 4.3, [openshift/cluster-dns-operator#140 Bug 1753059: Don't
+  start DNS on NotReady nodes by
+  ironcladlou](https://github.com/openshift/cluster-dns-operator/pull/140)
+  changed the blanket toleration to a toleration for a narrower set of taints in
+  order to avoid scheduling the DNS Pod on nodes without networking.
+- 2020-05-29, in OCP 4.5, [openshift/cluster-dns-operator#171 Bug 1813479:
+  Tolerate all taints by
+  Miciah](https://github.com/openshift/cluster-dns-operator/pull/171) reverted
+  #140 and restored the blanket toleration.  This change was then backported to
+  OCP 4.4 with
+  [#179](https://github.com/openshift/cluster-dns-operator/pull/179) and to OCP
+  4.3 with [#186](https://github.com/openshift/cluster-dns-operator/pull/186),
+  with the result that DNS Pods tolerate all taints with the latest z-stream
+  release of every OpenShift release up to and including OpenShift 4.7.
+
+## Alternatives
+
+Approaches to configure the DNS service to prefer a node-local DNS Pod have been
+investigated.  However, preferring a node-local endpoint would not prevent
+inter-node traffic if no node-local endpoint were available (for example, during
+a rolling upgrade of the DNS Pods) and would not address other use-cases where
+a cluster administrator does not want DNS Pods running on certain nodes.
+
+Configuring the container runtime to use the cluster DNS service has been
+considered.  If the container runtime used the cluster DNS service, then no
+entry for the cluster image registry would be needed in `/etc/hosts`, and the
+"dns-node-resolver" container could be removed entirely.  However, avoiding a
+bootstrap problem would be difficult with this approach: The container runtime
+requires DNS to pull images, but the DNS operator and DNS Pods cannot start
+until the container runtime has pulled their images.