Add KEP for volume scheduling limits

kubernetes · Apr 8, 2019 · c8eb762 · c8eb762
1 parent dee70c1
commit c8eb762
Showing 1 changed file with 233 additions and 0 deletions.
diff --git a/keps/sig-storage/20190408-volume-scheduling-limits.md b/keps/sig-storage/20190408-volume-scheduling-limits.md
@@ -0,0 +1,233 @@
+---
+title: KEP Template
+authors:
+  - "@jsafrane"
+owning-sig: sig-storage
+participating-sigs:
+  - sig-scheduling
+reviewers:
+  - "@bsalamat"
+  - "@gnufied"
+  - "@davidz627"
+approvers:
+  - "@bsalamat"
+  - "@davidz627"
+editor: TBD
+creation-date: 2019-04-08
+last-updated: 2019-04-08
+status: implementable
+see-also:
+ - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md
+replaces: https://github.com/kubernetes/enhancements/pull/730
+superseded-by:
+---
+
+# Volume Scheduling Limits
+
+## Table of Contents
+
+* [Volume Scheduling Limits](#volume-scheduling-limits)
+   * [Table of Contents](#table-of-contents)
+   * [Release Signoff Checklist](#release-signoff-checklist)
+   * [Summary](#summary)
+   * [Motivation](#motivation)
+      * [Goals](#goals)
+      * [Non-Goals](#non-goals)
+   * [Proposal](#proposal)
+      * [New API](#new-api)
+      * [User Stories](#user-stories)
+      * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
+      * [Risks and Mitigations](#risks-and-mitigations)
+   * [Design Details](#design-details)
+      * [Test Plan](#test-plan)
+      * [Graduation Criteria](#graduation-criteria)
+            * [Alpha -&gt; Beta Graduation](#alpha---beta-graduation)
+            * [Beta -&gt; GA Graduation](#beta---ga-graduation)
+            * [Removing a deprecated flag](#removing-a-deprecated-flag)
+      * [Upgrade / Downgrade / Version Skew Strategy](#upgrade--downgrade--version-skew-strategy)
+   * [Implementation History](#implementation-history)
+* [Alternatives](#alternatives)
+
+## Release Signoff Checklist
+
+- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
+- [ ] KEP approvers have set the KEP status to `implementable`
+- [ ] Design details are appropriately documented
+- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
+- [ ] Graduation criteria is in place
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
+
+## Summary
+
+Number of volumes of certain type that can be attached to a node should be configurable easily and should be based on node type. This proposal implements dynamic attachable volume limits on a per-node basis rather than cluster global defaults that exist today. This proposal also implements a way of configuring volume limits for CSI volumes.
+
+This proposal replaces [#730](https://github.com/kubernetes/enhancements/pull/730) and integrates volume limits for in-tree volumes (AWS EBS, GCE PD, AZURE DD, OpenStack Cinder) and CSI into one predicate. As result, in-tree volumes and corresponding CSI driver can share the same volume limit.
+
+## Motivation
+
+Current scheduler predicates for scheduling of pods with volumes is based on `node.status.capacity` and `node.status.allocatable`. It works well for hardcoded predicates for volume limits on AWS (`MaxEBSVolumeCount`), GCE(`MaxGCEPDVolumeCount`), Azure (`MaxAzureDiskVolumeCount`) and OpenStack (`MaxCinderVolumeCount`).
+
+It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://github.com/kubernetes/enhancements/pull/730)
+
+- `ResourceName` is limited to 63 characters. We must prefix `ResourceName` with unique string (such as `attachable-volumes-csi-<driver name>`) so it cannot collide with existing resources like `cpu` or `memory`. But `<driver name>` itself is up to 63 character long, so we ended up with using SHA-sums of driver name to keep the `ResourceName` unique, which is not user readable.
+- CSI driver cannot share its limits with in-tree volume plugin e.g. when running pods with AWS EBS in-tree volumes and `ebs.csi.aws.com` CSI driver on the same node.
+- `node.status` size increases with each installed CSI driver. Node objects are big enough already.
+
+### Goals
+
+- User can run use PVs both with in-tree volume plugins and CSI and they will share their limits. There is only one scheduler predicate that handles both kind of volumes.
+
+- Existing predicates for in-tree volumes `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are removed (with deprecation period).
+  - When both deprecated in-tree predicate and CSI predicate are enabled, only one of them does useful work and the other is NOOP to save CPU.
+
+- Scheduler does not increase its CPU consumption. Any regression must be approved by sig-scheduling.
+
+### Non-Goals
+
+
+## Proposal
+
+* Track volume limits for both in-tree volume plugins and CSI drivers in `CSINode` objects instead of `Node`.
+  * To get rid of prefix + SHA for `ResourceName` of CSI volumes.
+  * So in-tree volume plugin can share limits with CSI driver that uses the same storage backend.
+
+* Kubelet will create `CSINode` instance during initial node registration together with `Node` object.
+  * Limits of each in-tree volume plugin will be added to `CSINode.status.allocatable`, as reported by the plugin.
+    * Name of CSI driver corresponding to in-tree volume plugin will be used as key of these in-tree plugins, so the limits are shared between CSI driver and in-tree volume plugin in case both are running on the same node.
+    * If a CSI driver is registered for an in-tree volume plugin and it reports a different volume limit than in-tree volume plugin, the limit reported by CSI driver is used and kubelet logs a warning.
+  * User may NOT change `CSINode.allocatable` to override volume plugin / CSI driver values, e.g. to "reserve" some attachment to the operating system. Kubelet will periodically reconcile `CSINode` and overwrite the value.
+  * Kubelet will continue filling `Node.status.allocatable` and `Node.status.capacity` for both in-tree and CSI volumes during deprecation period. After the deprecation period, it will stop using them completely.
+    * Scheduler will ignore `Node.status.allocatable` and `Node.status.capacity` if `CSINode.status.allocatable` is present.
+    * To support old kubelet, `MaxCSIVolumeCountPred` (or any deprecated volume limit predicate) falls back to `Node.status.allocatable` when `CSINode` does not contain any limits for a volume plugin.
+    * After deprecation period, scheduler won't schedule any pods that use volumes to a node with missing `CSINode` instance. It is expected that it happens only during node registration that `Node` exists and `CSINode` doesn't and it self-heals quickly.
+
+* Existing scheduler predicates `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are already deprecated.
+  * If any of them is enabled together with `MaxCSIVolumeCountPred`, the deprecated predicate will do nothing (`MaxCSIVolumeCountPred` does the job of counting both in-tree and CSI volumes).
+  * The deprecated predicates will be active only when `MaxCSIVolumeCountPred` predicate is disabled.
+  * This way, we save CPU by running only one volume limit predicate during deprecation period.
+
+* Scheduler must check both `CSINode.spec.drivers` and `CSINode.status.allocatable` to check if a driver is installed at all and if there is a limit configured for it. Checking `CSINode.status.allocatable` alone won't tell if the driver is installed and its limit was evaluated and deliberately not set in `CSINode.status.allocatable`.
+
+TODO: what about in-tree drivers? They won't have an item in `CSINode.spec.drivers`! Use pointer in `CSINode.status.allocatable`: 
+  * `"ebs.csi.aws.com" : nil` plugin / driver exists, has no limit (can attach any nr. of volumes)
+  * `"ebs.csi.aws.com" : 0`: limit is zero (no volume can be attached)
+  * `"ebs.csi.aws.com"` key does not exist: the driver is not installed (incl. in-tree one)  
+
+CSINode example:
+
+```
+apiVersion: storage.k8s.io/v1beta1
+kind: CSINode
+metadata:
+  name: ip-172-18-4-112.ec2.internal
+spec:
+status:
+  allocatable:
+    "ebs.csi.aws.com": 39
+```
+
+### New API
+
+CSINode gets `Status` struct with `Allocatable`, holding limit of volumes for each volume plugin and CSI driver that can be scheduled to the node.
+
+```go
+type CSINode struct {
+	...
+
+	// spec is the status of CSINode
+	Status CSINodeStatus `json:"status" protobuf:"bytes,3,opt,name=status"`
+}
+
+// VolumeLimits is map CSI driver name -> maximum count of volumes for the driver on the node.
+// For in-tree volume plugins, name of corresponding CSI driver is used.
+// Value can be either:
+// - Positive integer: that's the volume limit.
+// - Zero: such volume cannot be used on the node.
+// - Driver is missing in the map: there is no limit of volumes for the driver.
+type VolumeLimits map[string]int64
+
+// CSINodeStatus holds information about the status of all CSI drivers installed on a node
+type CSINodeStatus struct {
+	// allocatable is a list of volume limits for each volume plugin and CSI driver on the node.
+	// +patchMergeKey=name
+	// +patchStrategy=merge
+	Allocatable []VolumeLimits `json:"allocatable" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=allocatable"`
+}
+```
+
+TODO: better name for `allocatable`. `VolumeAttachLimit` implies only attachable volumes, which is too restrictive. `VolumeLimit`?
+
+TODO: add `capacity` + `--system-reserved-volumes=` and `--kube-reserved-volumes=` so admin can reserve some capacity? Or can we expect that CSI drivers already reports limit already reduced by amount reserved for the system?
+
+### User Stories
+
+TODO: fill? Is there any interesting user story?
+
+### Implementation Details/Notes/Constraints
+
+[CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) is used to find CSI driver name for in-tree volume plugins. This CSI driver name is used as key in `CSINode.status.capacity` and `CSINode.status.allocatable` lists.
+
+### Risks and Mitigations
+
+* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). It can happen that CSI migration is redesigned / cancelled.
+  * Countermeasure: [CSI migration](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md) and this KEP should graduate together.
+
+## Design Details
+
+Existing feature gate `AttachVolumeLimit` will be re-used for implementation of this KEP. The feature is already beta and is enabled by default.
+
+### Test Plan
+
+* Run [scheduler benchmark](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scheduling/scheduler_benchmarking.md) with matrix composed of:
+  * Predicates:
+    * All volume predicates enabled.
+    * Only deprecated `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` predicates enabled.
+    * Only `MaxCSIVolumeCountPred` predicate enabled.
+  * API objects:
+    * Both CSINode and Node containing `status.allocatable` for a volume plugin (to simulate kubelet during deprecation period).
+    * Only CSINode containing `status.allocatable` for a volume plugin (to simulate kubelet after deprecation period).
+    * Only Node containing `status.allocatable` for a volume plugin (to simulate old kubelet).
+  * Test results should be ideally the same as before the KEP.
+    * Any deviation needs to be approved by sig-scheduling.
+
+* Run e2e tests and kubelet version skew tests to check that scheduler picks the right values from CSINode or Node.
+
+* Add e2e test that runs pods with both in-tree volumes and CSI driver for the same storage backend and check that they share the same volume limits.
+
+### Graduation Criteria
+
+##### Alpha -> Beta Graduation
+
+N/A (`AttachVolumeLimit` feature is already beta).
+
+##### Beta -> GA Graduation
+
+It must graduate together with CSI migration.
+
+##### Removing a deprecated flag
+
+- Announce deprecation and support policy of the existing flag
+- Two versions passed since introducing the functionality which deprecates the flag (to address version skew)
+- Address feedback on usage/changed behavior, provided on GitHub issues
+- Deprecate the flag
+
+### Upgrade / Downgrade / Version Skew Strategy
+
+
+During upgrade, downgrade or version skew, kubelet may be older that scheduler. Kubelet will not fill `CSINode.status` with volume limits and it will fill volume limits into `Node.status`. Scheduler must fall back to `Node.status` when `CSINode` is not available or its `status` does not contain a volume plugin / CSI driver.
+
+## Implementation History
+
+
+# Alternatives
+
+In https://github.com/kubernetes/enhancements/pull/730 we tried to merge volume limits in `Node.status.capacity` and `Node.status.attachable`. We discovered these issues:
+
+* We cannot use plain CSI driver name as resource name `Node.status.attachable`, as it could collide with other resources (e.g. "memory"), so we added volume specific prefix.
+* Since CSI driver name can be [up to 63 character long](https://github.com/container-storage-interface/spec/blob/master/spec.md#getplugininfo), the prefix + driver name it cannot fit 64 character resource name limit. We ended up hashing the driver name to save space.
+
+By moving volume limit to CSINode we fix both issues.