Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ClusterOperator Scheduling #650

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions enhancements/scheduling/clusteroperator-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: clusteroperator-scheduling
authors:
- "@michaelgugino"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2021-02-12
last-updated: 2021-02-12
status: implementable
---

# ClusterOperator Scheduling


## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

To ensure system components, especially ClusterOperators, have the correct
set of tolerations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly PR with adjustments to this existing enhancements content, instead of creating a new, parallel enhancement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have used grep! I looked in the scheduling folder but didn't find this obviously.


## Motivation

The motivation is to create an easy to follow reference for developing and
updating ClusterOperators and similar system components.

### Goals

* List best-practices for tolerations

### Non-Goals

* Non-standard cluster profiles such as single node
* Add all practices here today, we can add them over time.

## Proposal

### User Stories

#### Story 1

As a cluster administrator, I want upgrades to proceed safely and without
intervention.

#### Story 2

As an OpenShift developer, I want to be able to reference a guide to ensuring
my ClusterOperators are properly defined for scheduling purposes.

### Implementation Details/Notes/Constraints [optional]

Not all elements of scheduling are defined here. There are several, and we
can update this document as needed.

### Risks and Mitigations

Not all ClusterOperators behave identically. Careful consideration for each
ClusterOperator is paramount.

## Design Details

### NoSchedule Tolerations

#### node.kubernetes.io/unschedulable

There are various `NoSchedule` taints that are commonly applied to nodes.
ClusterOperators should not tolerate NoSchedule taints with key:
`node.kubernetes.io/unschedulable`

This is a special key that is applied when a node is cordoned, such as during
drain. Since this is a NoSchedule taint, anything without this taint will
not automatically be evicted, it will only prevent future scheduling.

During upgrades, we cordon and drain all nodes. Sometimes the scheduler will
place pods tolerating NoSchedule onto a node quicker than the loop for
draining a node can complete. This results in hot-looping between draining
and the scheduler.

Any other NoSchedule toleration is fine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of Chesterton fence, I’d like to see a review of the components that thought they should set this and why (representative sample).

That ensures we correct misconceptions about when the toleration is appropriate.

If it was accidental, what would have prevented the accident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one I know of for sure is the CVO. I don't have data if anything else is doing it or not. I will take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the CVO is the only-non daemonset pod that has this toleration, at least from a typical CI run. The daemonsets get that toleration dynamically, so everything else looks correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a discussion sometime ago on whether CVO should tolerate NoSchedule taint and the answer was yes. Has something changed now? I was always in favour of specific tolerations instead of blanket tolerations. How about adding a taint called drain-in-progress to node with NoSchedule effect. That won't be tolerated by CVO which should help in proceeding with draining smoothly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing that discussion. There is more than one 'NoSchedule' taint. Based on the discussion you linked, it seems node.kubernetes.io/unschedulable was included by mistake. This is not discussed directly, the discussion revolved around unreachable nodes and unready nodes. node.kubernetes.io/unschedulable does not affect those scenarios and is only used for administrative cordons (eg, node.spec.unschedulable=True).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entire sure of that, IIUC I got a firm yes on unschedulable however if you think this was not needed, I can change the tolerations of CVO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many taints that have the effect 'NoSchedule' but there is one 'node.kubernetes.io/unschedulable' taint with the effect 'NoSchedule'. It appears in the previous discussion only NoSchedule was considered rather than the specific taints (at the least, node.kubernetes.io/unschedulable was not itself considered directly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any NoSchedule taint != unschedulable condition. unschedulable condition == node.kubernetes.io/unschedulable taint. unschedulable is a node condition which is being represented via a special well known taint called node.kubernetes.io/unschedulable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to use the k8s precise language. There are taints, and those taints have effects. The effect we're discussing is NoSchedule. That effect is caused by one or more taints.

I did not see the taint node.kubernetes.io/unschedulable discussed in the link you provided. There were only discussions around the NoSchedule effect.


### Open Questions [optional]



### Test Plan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed we would enforce this via a test? Or did I miss that above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that ensure control plane operators do not make themselves unevictable does not cover node.kubernetes.io/unschedulable today? Just unreachable and not-ready.

Mostly everything should work identically in CI.

## Implementation History


## Drawbacks

Adjusting scheduling parameters can cause undesired effects in some cases.

## Alternatives

Without these guidelines, we might face difficult situations for automated
operations such as upgrades.