Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns: Add configurable-dns-pod-placement enhancement #663

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
359 changes: 359 additions & 0 deletions enhancements/dns/configurable-dns-pod-placement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
---
title: configurable-dns-pod-placement
authors:
- "@Miciah"
reviewers:
- "@candita"
- "@danehans"
- "@frobware"
- "@knobunc"
- "@miheer"
- "@rfredette"
- "@sgreene570"
approvers:
- "@danehans"
- "@frobware"
- "@knobunc"
creation-date: 2021-02-23
last-updated: 2021-03-29
status: implementable
see-also:
replaces:
superseded-by:
---

# Configurable DNS Pod Placement

This enhancement enables cluster administrators to configure the placement of
the CoreDNS Pods that provide cluster DNS service.

## Release Signoff Checklist

- [X] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [X] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

The DNS operator in OpenShift 4.7 and prior versions manages a DaemonSet that
serves two functions: running CoreDNS and managing node hosts' `/etc/hosts`
files. This enhancement, in OpenShift 4.8, replaces this single DaemonSet with
two: one DaemonSet for CoreDNS and one DaemonSet for managing `/etc/hosts`.
Additionally, this enhancement adds an API to enable cluster administrators to
configure the placement of the CoreDNS Pods.

## Motivation

OpenShift 4.7 uses a single DaemonSet for both CoreDNS and for managing node
hosts' `/etc/hosts` files. Specifically, this DaemonSet has a container that
adds an entry for the cluster image registry to `/etc/hosts` to enable the
container runtime (which does not use the cluster DNS service) to resolve and
thus pull from the cluster image registry.

Because `/etc/hosts` needs to be managed on every node host, this DaemonSet must
run on every node host. Moreover, management of `/etc/hosts` is a critical
service because the node host may fail to pull images (including those of core
components) unless `/etc/hosts` has an entry for the cluster image registry.
Consequently the DaemonSet has a toleration for all taints so that the DNS Pod
always runs on all nodes.

Some cluster administrators require the ability to configure DNS not to run on
certain nodes. For example, security policies may prohibit communication
between certain pairs of nodes; a DNS query from an arbitrary Pod on some node A
to the DNS Pod on some other node B might fail if some security policy prohibits
communication between node A and node B.

Splitting CoreDNS and management of `/etc/hosts` into separate DaemonSets makes
it possible to remove the blanket toleration for all taints from the CoreDNS
DaemonSet while keeping the blanket toleration on the DaemonSet that manages
`/etc/hosts`. Splitting the DaemonSet also makes it possible to enable use of a
custom node selector on the CoreDNS DaemonSet.

Another advantage of using separates DaemonSets is that the DaemonSet that
updates `/etc/hosts` can use the host network to avoid consuming SR-IOV devices
on nodes that have Smart NICs.

### Goals

1. Separate CoreDNS and management of `/etc/hosts` into separate DaemonSets.
2. Enable cluster administrators to control where the CoreDNS DaemonSet is scheduled.

### Non-Goals

1. Enable cluster administrators to control the placement of the DaemonSet that manages `/etc/hosts`.
2. Enforce security policies.

## Proposal

This enhancement has two distinct parts. First, the DNS operator, which manages
the "dns-default" DaemonSet, is modified to manage an additional "node-resolver"
DaemonSet, and the "dns-node-resolver" container, which manages `/etc/hosts`, is
moved from the "dns-default" DaemonSet to a new "node-resolver" DaemonSet. As
part of this change, the toleration for all taints is removed from the
"dns-default" DaemonSet. The DaemonSet that manages `/etc/hosts` has blanket
toleration and uses the host network. From the cluster administrator's
perspective, this DaemonSet split is an internal change.

Second, a new API is provided to enable cluster administrators to specify the
desired placement of the "dns-default" DaemonSet's Pods, which, due to the first
change, only run CoreDNS and no longer must be scheduled to every node. This
new API is the user-facing part of this enhancement.

The DNS operator API is extended by adding an optional `NodePlacement` field
with type `DNSNodePlacement` to `DNSSpec`:

```go
// DNSSpec is the specification of the desired behavior of the DNS.
type DNSSpec struct {
// ...

// nodePlacement provides explicit control over the scheduling of DNS
// pods.
//
// Generally, it is useful to run a DNS pod on every node so that DNS
// queries are always handled by a local DNS pod instead of going over
// the network to a DNS pod on another node. However, security policies
// may require restricting the placement of DNS pods to specific nodes.
// For example, if a security policy prohibits pods on arbitrary nodes
// from communicating with the API, a node selector can be specified to
// restrict DNS pods to nodes that are permitted to communicate with the
// API. Conversely, if running DNS pods on nodes with a particular
// taint is desired, a toleration can be specified for that taint.
//
// If unset, defaults are used. See nodePlacement for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If unset, defaults are used. See nodePlacement for more details.
// If unset, defaults are used. See dnsNodePlacement for more details.

Is this what you meant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nodePlacement is right; the godoc should reference the field by its name, which is nodePlacement.

Copy link
Contributor

@sgreene570 sgreene570 Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to say "refer to x for more info on x" though?
Maybe i'm overthinking this. https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L134 uses the type rather than field name (ah well actually thats ambiguous!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way works for me. Not worth bike shedding over this.

//
// +optional
NodePlacement DNSNodePlacement `json:"nodePlacement,omitempty"`
}
```

The `DNSNodePlacement` type has fields to specify a node selector and
tolerations:

```go
// DNSNodePlacement describes the node scheduling configuration for DNS pods.
type DNSNodePlacement struct {
// nodeSelector is the node selector applied to DNS pods.
//
// If empty, the default is used, which is currently the following:
//
// beta.kubernetes.io/os: linux
//
// This default is subject to change.
//
// If set, the specified selector is used and replaces the default.
//
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`

// tolerations is a list of tolerations applied to DNS pods.
//
// The default is an empty list. This default is subject to change.
//
// See https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
//
// +optional
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
}
```

By default, DNS Pods run on untainted Linux nodes. The `NodePlacement` field
enables cluster administrators to specify alternative parameters. For example,
the following DNS specifies that DNS Pods should run only on "infra" nodes
(i.e., nodes that have the "node-role.kubernetes.io/infra" label and taint):

```yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
name: default
spec:
nodePlacement:
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/infra: ""
tolerations:
- effect: NoSchedule
key: "node-role.kubernetes.io/infra"
operator: Exists
```

Note that the new `NodeSelector` field has type `map[string]string`, rather than
`*metav1.LabelSelector`, to match the type on the DaemonSet's
`spec.template.spec.nodeSelector` field's type. Thus the user can specify label
key-value pairs to match but cannot specify arbitrary match expressions as using
`metav1.LabelSelector` would allow.

### Validation

Omitting `spec.nodePlacement` or its subfields specifies the default behavior.

The API validates that `spec.nodePlacement.nodeSelector`, if specified, is a
valid node selector and that `spec.nodePlacement.tolerations`, if specified, is
a list of valid tolerations.

### User Stories

#### As a cluster administrator, I must comply with a security policy that prohibits communication among worker nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is solving this use case with explicit placement more appropriate than using service topology? This is a pretty specific constraint (and doesn’t cover other node to node traffic like monitoring, sdn, control plane traffic, etc). While constraining DNS to nodes is reasonable on its own right for administrator topology choices, this specific example is fairly vague / not particularly complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service topology is alpha and deprecated, to be superseded by internal traffic policy, which is alpha (and not yet implemented AFAIK). Users have been strenuously complaining about this for a while—some security policies require prohibiting traffic among worker nodes, so DNS needs to run on non-worker nodes only.


To satisfy this use-case, the cluster administrator can specify a node selector
that includes only control-plane nodes, using the new
`spec.nodePlacement.nodeSelector` API field as follows:

```yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
name: default
spec:
nodePlacement:
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/master: ""
```

#### As a cluster administrator, I want to allow DNS Pods to run on nodes that have a taint that has key "dns-only" and effect `NoSchedule`

To satisfy this use-case, the cluster administrator can specify a toleration for
the taint in question as follows:

```yaml
apiVersion: operator.openshift.io/v1
kind: DNS
metadata:
name: default
spec:
nodePlacement:
tolerations:
- effect: NoSchedule
key: "dns-only"
operator: Exists
```

### Implementation Details

Implementing this enhancement requires changes in the following repositories:

* openshift/api
* openshift/cluster-dns-operator

The DNS operator is modified to manage both the "dns-default" DaemonSet and the
"node-resolver" DaemonSet, both in the "openshift-dns" namespace. The
"dns-node-resolver" container is removed from the "dns-default" DaemonSet if it
already exists, as are any tolerations and label selectors that are not
configured per the new API. The operator is modified to apply the configured
tolerations and node label selectors to the "dns-default" DaemonSet. The
"dns-node-resolver" is configured to tolerate all taints (as the "dns-default"
DaemonSet does in OpenShift 4.7 and earlier) and run on all Linux nodes.

### Risks and Mitigations

A cluster administrator could configure a node selector, or taint all nodes, in
such a way that DNS Pods could not be scheduled to any node, rendering the DNS
service unavailable. Because the DNS service is critical to other cluster
components including OAuth, fixing misconfigured DNS Pod placement parameters
could be impossible for the cluster administrator to do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only scenario that could block fixing this the user being unable to login? I’m skeptical that a one time check for placement is the right approach, vs (for example) always ensuring dns runs on the control plane nodes if there are no valid worker nodes in use (or separately as a fallback). Emulating the scheduler is probably not a great idea (vs detecting zero pods at runtime or always running the previous topology until the new topology is in place, or “failing open” if no pods are detected). Generally doing any sort of static detection of schedulability adds fragility to a system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if DNS isn't available to resolve kubernetes.default.svc.cluster.local, then presumably that could break OAuth. If the administrator didn't already have an OAuth token handy, the problem might then be impossible to resolve. If the user did have an OAuth token, it should be possible to patch the CRD to revert the problematic node-placement parameters. I agree that trying to infer what the scheduler will do is an imperfect mitigation. We could add some logic to the reconcile loop to detect if the operator has been reporting degraded for >n minutes and has 0 DNS pods; would it be valid for the operator to revert spec.nodePlacement, or does the operator need to maintain some state to track that the current value in spec.nodePlacement resulted in having 0 pods and needs to be ignored?


As a mitigation for this risk, this enhancement adds logic in the DNS operator
to validate the desired node-placement parameters, before applying them, by
listing nodes and verifying that at least one matches the specified criteria.
If the desired node-placement parameters would prevent any DNS Pods from being
scheduled to any node, the operator does not apply the new parameters to the
DaemonSet.

This mitigation has the drawback that it cannot prevent DNS Pods from being
removed if the cluster administrator removes, relabels, or taints nodes where
DNS Pods are already scheduled.

## Design Details

### Test Plan

Unit tests are added to verify the functionality of the new API.

Additionally, an end-to-end test is added that configures a node selector to
select master nodes and verifies that the operator implements the change, then
configures a node selector to select no nodes and verifies that the operator
ignores the change.

### Graduation Criteria

N/A.

### Upgrade / Downgrade Strategy

On upgrade, the DNS operator removes the "dns-node-resolver" container from the
existing "dns-default" DaemonSet and creates a new "node-resolver" DaemonSet.

On downgrade, the DNS operator leaves the "node-resolver" DaemonSet running,
which results in each node's having two "dns-node-resolver" containers: one in
the "dns-default" DaemonSet and one in the "node-resolver" DaemonSet. However,
both "dns-node-resolver" containers write the same content to `/etc/hosts`, and
they write the file atomically, so the redundant updates should not cause
conflicts.

### Version Skew Strategy

N/A.

## Implementation History

- 2018-10-05, in OCP 4.0, [openshift/cluster-dns-operator#34 update resources to
avoid openshift cycles by
deads2k](https://github.com/openshift/cluster-dns-operator/pull/34) added a
blanket toleration for all taints.
- 2019-11-06, in OCP 4.3, [openshift/cluster-dns-operator#140 Bug 1753059: Don't
start DNS on NotReady nodes by
ironcladlou](https://github.com/openshift/cluster-dns-operator/pull/140)
changed the blanket toleration to a toleration for a narrower set of taints in
order to avoid scheduling the DNS Pod on nodes without networking.
- 2020-05-29, in OCP 4.5, [openshift/cluster-dns-operator#171 Bug 1813479:
Tolerate all taints by
Miciah](https://github.com/openshift/cluster-dns-operator/pull/171) reverted
#140 and restored the blanket toleration. This change was then backported to
OCP 4.4 with
[#179](https://github.com/openshift/cluster-dns-operator/pull/179) and to OCP
4.3 with [#186](https://github.com/openshift/cluster-dns-operator/pull/186),
with the result that DNS Pods tolerate all taints with the latest z-stream
release of every OpenShift release up to and including OpenShift 4.7.
- In OCP 4.8, [openshift/cluster-dns-operator#209 Add node-resolver
daemonset](https://github.com/openshift/cluster-dns-operator/pull/209) splits
the DNS DaemonSet into two DaemonSets and implements the DNS Pod placement
API.

## Alternatives

Approaches to configure the DNS service to prefer a node-local DNS Pod have been
investigated. However, preferring a node-local endpoint would not prevent
inter-node traffic if no node-local endpoint were available (for example, during
a rolling upgrade of the DNS Pods) and would not address other use-cases where
a cluster administrator does not want DNS Pods running on certain nodes.

Configuring the container runtime to use the cluster DNS service has been
considered. If the container runtime used the cluster DNS service, then no
entry for the cluster image registry would be needed in `/etc/hosts`, and the
"dns-node-resolver" container could be removed entirely. However, avoiding a
bootstrap problem would be difficult with this approach: The container runtime
requires DNS to pull images, but the DNS operator and DNS Pods cannot start
until the container runtime has pulled their images.

As described in "Upgrade / Downgrade Strategy", downgrades from OpenShift 4.8 to
4.7 restore the "dns-node-resolver" container in the "dns-default" DaemonSet and
leave the "node-resolver" DaemonSet with its own "dns-node-resolver" container
running, which redundantly updates `/etc/hosts`. We could backport a change to
OpenShift 4.7 to clean up any "node-resolver" DaemonSet that a downgrade may
have left. Even if this enhancement is implemented as currently planned, this
option to backport logic to clean up any old DaemonSet remains available to us
should the need for it arise.

As described in "Risks and Mitigations", the operator validates any provided
placement parameters before applying them to the DNS DaemonSet. However, an
administrator could configure valid placement parameters (meaning the parameters
allow some DNS Pods to be scheduled) and then remove, relabel, or taint nodes in
such a way that no DNS Pods would remain running. We considered making the DNS
operator revert user-specified placement parameters to the default node selector
and a blanket toleration for all taints if it detected that no DNS Pod were
scheduled to any node. However, this would be difficult to do in a safe and
transparent way. We may consider it as a follow-on improvement.