Skip to content

Commit

Permalink
dns: Add configurable-dns-pod-placement enhancement
Browse files Browse the repository at this point in the history
  • Loading branch information
Miciah committed Feb 23, 2021
1 parent 8090781 commit 1dea562
Showing 1 changed file with 306 additions and 0 deletions.
306 changes: 306 additions & 0 deletions enhancements/dns/configurable-dns-pod-placement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
---
title: configurable-dns-pod-placement
authors:
- "@Miciah"
reviewers:
- "@candita"
- "@danehans"
- "@frobware"
- "@knobunc"
- "@miheer"
- "@rfredette"
- "@sgreene570"
approvers:
- "@danehans"
- "@frobware"
- "@knobunc"
creation-date: 2021-02-23
last-updated: 2021-02-23
status: implementable
see-also:
replaces:
superseded-by:
---

# Configurable DNS Pod Placement

This enhancement enables cluster administrators to configure the placement of
the CoreDNS Pods that provide cluster DNS service.

## Release Signoff Checklist

- [X] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [X] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

The DNS operator in OpenShift 4.7 and prior versions manages a DaemonSet that
serves two functions: running CoreDNS and managing node hosts' `/etc/hosts`
files. This enhancement, in OpenShift 4.8, replaces this single DaemonSet with
two: one DaemonSet for CoreDNS and one DaemonSet for managing `/etc/hosts`.
Additionally, this enhancement adds an API to enable cluster administrators to
configure the placement of the CoreDNS Pods.

## Motivation

OpenShift 4.7 uses a single DaemonSet for both CoreDNS and for managing node
hosts' `/etc/hosts` files. Specifically, this DaemonSet has a container that
adds an entry for the cluster image registry to `/etc/hosts` to enable the
container runtime (which does not use the cluster DNS service) to resolve and
thus pull from the cluster image registry.

Because `/etc/hosts` needs to be managed on every node host, this DaemonSet must
run on every node host. Moreover, management of `/etc/hosts` is a critical
service because the node host may fail to pull images (including those of core
components) absent an entry for the cluster image registry in `/etc/hosts`.
Consequently the DaemonSet has a toleration for all taints so that the DNS Pod
always runs on all nodes.

Some cluster administrators require the ability to configure DNS not to run on
certain nodes. For example, security policies may prohibit communication
between certain pairs of nodes; a DNS query from an arbitrary Pod on some node A
to the DNS Pod on some other node B might fail if some security policy prohibits
communication between node A and node B.

Splitting CoreDNS and management of `/etc/hosts` into separate DaemonSets makes
it possible to remove the blanket toleration for all taints from the CoreDNS
DaemonSet while keeping the blanket toleration on the DaemonSet that manages
`/etc/hosts`. Splitting the DaemonSet also makes it possible to enable use of a
custom node selector on the CoreDNS DaemonSet.

### Goals

1. Separate CoreDNS and management of `/etc/hosts` into separate DaemonSets.
2. Enable cluster administrators to control where the CoreDNS DaemonSet is scheduled.

### Non-Goals

1. Enable cluster administrators to control the placement of the DaemonSet that manages `/etc/hosts`.
2. Enforce security policies.

## Proposal

This enhancement has two distinct parts. First, the DNS operator, which manages
the "dns-default" DaemonSet, is modified to manage an additional "node-resolver"
DaemonSet, and the "dns-node-resolver" container, which manages `/etc/hosts`, is
moved from the "dns-default" DaemonSet to a new "node-resolver" DaemonSet. As
part of this change, the toleration for all taints is removed from the
"dns-default" DaemonSet. From the cluster administrator's perspective, this
DaemonSet split is an internal change.

Second, a new API is provided to enable cluster administrators to specify the
desired placement of the "dns-default" DaemonSet's Pods, which, due to the first
change, only run CoreDNS and no longer must be scheduled to every node. This
new API is the user-facing part of this enhancement.

The DNS operator API is extended by adding an optional `NodePlacement` field
with type `DNSNodePlacement` to `DNSSpec`:

```go
// DNSSpec is the specification of the desired behavior of the DNS.
type DNSSpec struct {
// ...

// nodePlacement enables explicit control over the scheduling of DNS pods.
//
// If unset, defaults are used. See nodePlacement for more details.
//
// +optional
NodePlacement DNSNodePlacement `json:"nodePlacement,omitempty"`
}
```

The `DNSNodePlacement` type has fields to specify a node selector and
tolerations:

```go
// DNSNodePlacement describes node scheduling configuration for DNS pods.
type DNSNodePlacement struct {
// nodeSelector is the node selector applied to DNS pods.
//
// If unset, the default is the following:
//
// beta.kubernetes.io/os: linux
//
// If set, the specified selector is used and replaces the default.
//
// +optional
NodeSelector *metav1.LabelSelector `json:"nodeSelector,omitempty"`

// tolerations is a list of tolerations applied to DNS pods.
//
// The default is an empty list.
//
// See https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
//
// +optional
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
}
```

By default, DNS Pods run on untainted Linux nodes. The `NodePlacement` field
enables cluster administrators to specify alternative parameters. For example,
the following DNS specifies that DNS Pods should run only on "infra" nodes
(i.e., nodes that have the "node-role.kubernetes.io/infra" label):

```yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
name: default
spec:
nodePlacement:
nodeSelector:
matchLabels:
node-role.kubernetes.io/infra: ""
```
### Validation
Omitting `spec.nodePlacement` or its subfields specifies the default behavior.

The API validates that `spec.nodePlacement.nodeSelector`, if specified, is a
valid node selector and that `spec.nodePlacement.tolerations`, if specified, is
a list of valid tolerations.

### User Stories

#### As a cluster administrator, I must comply with a security policy that prohibits communication among worker nodes

To satisfy this use-case, the cluster administrator can specify a node selector
that includes only control-plane nodes, using the new
`spec.nodePlacement.nodeSelector` API field as follows:

```yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
name: default
spec:
nodePlacement:
nodeSelector:
matchLabels:
node-role.kubernetes.io/master: ""
```

#### As a cluster administrator, I want to allow DNS Pods to run on nodes that have a taint that has key "dns-only" and effect `NoSchedule`

To satisfy this use-case, the cluster administrator can specify a toleration for
the taint in question as follows:

```yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
name: default
spec:
nodePlacement:
tolerations:
- effect: NoSchedule
key: "dns-only"
operator: Exists
```

### Implementation Details

Implementing this enhancement requires changes in the following repositories:

* openshift/api
* openshift/cluster-dns-operator

The DNS operator is modified to manage both the "dns-default" DaemonSet and the
"node-resolver" DaemonSet, both in the "openshift-dns" namespace. The
"dns-node-resolver" container is removed from the "dns-default" DaemonSet if it
already exists, as are any tolerations and label selectors that are not
configured per the new API. The operator is modified to apply the configured
tolerations and node label selectors to the "dns-default" DaemonSet. The
"dns-node-resolver" is configured to tolerate all taints (as the "dns-default"
DaemonSet does in OpenShift 4.7 and earlier) and run on all Linux nodes.

### Risks and Mitigations

A cluster administrator could configure a node selector, or taint all nodes, in
such a way that DNS Pods could not be scheduled to any node, rendering the DNS
service unavailable.

Because the DNS service is critical to other cluster components including OAuth,
fixing misconfigured DNS Pod placement parameters could be impossible for the
cluster administrator to do.

As a mitigation to this risk, the DNS operator could verify the desired node
placement parameters before applying them by listing nodes and verifying that at
least one matched the specified criteria. This mitigation would have the
drawback that it would not help if nodes were tainted or relabeled such that DNS
Pods were removed after the DNS Pod placement had been configured.

As an alternative or complementary mitigation, the DNS operator could revert the
"default-dns" DaemonSet to the default node selector and a blanket toleration
for all taints if it detected that no DNS Pod were scheduled to any node.

## Design Details

### Test Plan

Unit tests are added to verify the functionality of the new API. Additionally,
an end-to-end test is added that configures an invalid node selector; the test
then verifies that the operator detects that the node selector prohibits
scheduling any Pods and that the operator reverts the change.

### Graduation Criteria

N/A.

### Upgrade / Downgrade Strategy

On upgrade, the DNS operator removes the "dns-node-resolver" container from the
existing "dns-default" DaemonSet and creates a new "node-resolver" DaemonSet.

On downgrade, the DNS operator may restore the "dns-node-resolver" container in
the "dns-default" DaemonSet and leave the "node-resolver" DaemonSet, which would
redundantly update `/etc/hosts`. However, both "dns-node-resolver" containers
should write the same content to `/etc/hosts`, and they write the file
atomically, so the redundant updates should not cause conflicts.

### Version Skew Strategy

N/A.

## Implementation History

- 2018-10-05, in OCP 4.0, [openshift/cluster-dns-operator#34 update resources to
avoid openshift cycles by
deads2k](https://github.com/openshift/cluster-dns-operator/pull/34) added a
blanket toleration for all taints.
- 2019-11-06, in OCP 4.3, [openshift/cluster-dns-operator#140 Bug 1753059: Don't
start DNS on NotReady nodes by
ironcladlou](https://github.com/openshift/cluster-dns-operator/pull/140)
changed the blanket toleration to a toleration for a narrower set of taints in
order to avoid scheduling the DNS Pod on nodes without networking.
- 2020-05-29, in OCP 4.5, [openshift/cluster-dns-operator#171 Bug 1813479:
Tolerate all taints by
Miciah](https://github.com/openshift/cluster-dns-operator/pull/171) reverted
#140 and restored the blanket toleration. This change was then backported to
OCP 4.4 with
[#179](https://github.com/openshift/cluster-dns-operator/pull/179) and to OCP
4.3 with [#186](https://github.com/openshift/cluster-dns-operator/pull/186),
with the result that DNS Pods tolerate all taints with the latest z-stream
release of every OpenShift release up to and including OpenShift 4.7.

## Alternatives

Approaches to configure the DNS service to prefer a node-local DNS Pod have been
investigated. However, preferring a node-local endpoint would not prevent
inter-node traffic if no node-local endpoint were available (for example, during
a rolling upgrade of the DNS Pods) and would not address other use-cases where
a cluster administrator does not want DNS Pods running on certain nodes.

Configuring the container runtime to use the cluster DNS service has been
considered. If the container runtime used the cluster DNS service, then no
entry for the cluster image registry would be needed in `/etc/hosts`, and the
"dns-node-resolver" container could be removed entirely. However, avoiding a
bootstrap problem would be difficult with this approach: The container runtime
requires DNS to pull images, but the DNS operator and DNS Pods cannot start
until the container runtime has pulled their images.

0 comments on commit 1dea562

Please sign in to comment.