kubernetes · k8s-ci-robot · Mar 17, 2022 · Mar 4, 2022 · Mar 4, 2022 · Mar 4, 2022
diff --git a/keps/sig-storage/1472-storage-capacity-tracking/README.md b/keps/sig-storage/1472-storage-capacity-tracking/README.md
@@ -47,7 +47,6 @@
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
   - [No modeling of storage capacity usage](#no-modeling-of-storage-capacity-usage)
-  - [&quot;Total available capacity&quot; vs. &quot;maximum volume size&quot;](#total-available-capacity-vs-maximum-volume-size)
   - [Prioritization of nodes](#prioritization-of-nodes)
   - [Integration with <a href="https://github.com/kubernetes/autoscaler">Cluster Autoscaler</a>](#integration-with-cluster-autoscaler)
   - [Alternative solutions](#alternative-solutions)
@@ -281,63 +280,89 @@ still an unsolved problem.
 #### CSIStorageCapacity
 
 ```
-// CSIStorageCapacity stores the result of one CSI GetCapacity call for one
-// driver, one topology segment, and the parameters of one storage class.
+// CSIStorageCapacity stores the result of one CSI GetCapacity call.
+// For a given StorageClass, this describes the available capacity in a
+// particular topology segment.  This can be used when considering where to
+// instantiate new PersistentVolumes.
+//
+// For example this can express things like:
+// - StorageClass "standard" has "1234 GiB" available in "topology.kubernetes.io/zone=us-east1"
+// - StorageClass "localssd" has "10 GiB" available in "kubernetes.io/hostname=knode-abc123"
+//
+// The following three cases all imply that no capacity is available for
+// a certain combination:
+// - no object exists with suitable topology and storage class name
+// - such an object exists, but the capacity is unset
+// - such an object exists, but the capacity is zero
+//
+// The producer of these objects can decide which approach is more suitable.
+//
+// They are consumed by the kube-scheduler when a CSI driver opts into capacity-aware
+// scheduling with CSIDriverSpec.StorageCapacity. The scheduler compares the
+// MaximumVolumeSize against the requested size of pending volumes to filter
+// out unsuitable nodes. If MaximumVolumeSize is unset, it falls back to
+// a comparison against the less precise Capacity. If that is also unset,
+// the scheduler assumes that capacity is insufficient and tries some other node.
 type CSIStorageCapacity struct {
-    metav1.TypeMeta
-    // Standard object's metadata. The name has no particular meaning and just has to
-    // meet the usual requirements (length, characters, unique). To ensure that
-    // there are no conflicts with other CSI drivers on the cluster, the recommendation
-    // is to use csisc-<uuid>.
-    //
-    // Objects are not namespaced.
-    //
-    // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
-    // +optional
-    metav1.ObjectMeta
-
-    // Spec contains the fixed properties of one capacity value.
-    Spec CSIStorageCapacitySpec
-
-    // Status contains the properties that can change over time.
-    Status CSIStorageCapacityStatus
-}
-
-// CSIStorageCapacitySpec contains the fixed properties of one capacity value.
-// one capacity value.
-type CSIStorageCapacitySpec struct {
-    // The CSI driver that provides the storage.
-    // This must be the string returned by the CSI GetPluginName() call.
-    DriverName string
-
-    // NodeTopology defines which nodes have access to the storage for which
-    // capacity was reported. If not set, the storage is accessible from all
-    // nodes in the cluster.
-    // +optional
-    NodeTopology *v1.NodeSelector
-
-    // The storage class name of the StorageClass which provided the
-    // additional parameters for the GetCapacity call.
-    StorageClassName string
-}
-
-// CSIStorageCapacityStatus contains the properties that can change over time.
-type CSIStorageCapacityStatus struct {
-    // AvailableCapacity is the value reported by the CSI driver in its GetCapacityResponse
-    // for a GetCapacityRequest with topology and parameters that match the
-    // CSIStorageCapacitySpec.
-    //
-    // The semantic is currently (CSI spec 1.2) defined as:
-    // The available capacity, in bytes, of the storage that can be used
-    // to provision volumes.
-    AvailableCapacity *resource.Quantity
+	metav1.TypeMeta
+	// Standard object's metadata. The name has no particular meaning. It must be
+	// be a DNS subdomain (dots allowed, 253 characters). To ensure that
+	// there are no conflicts with other CSI drivers on the cluster, the recommendation
+	// is to use csisc-<uuid>, a generated name, or a reverse-domain name which ends
+	// with the unique CSI driver name.
+	//
+	// Objects are namespaced.
+	//
+	// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
+	// +optional
+	metav1.ObjectMeta
+
+	// NodeTopology defines which nodes have access to the storage
+	// for which capacity was reported. If not set, the storage is
+	// not accessible from any node in the cluster. If empty, the
+	// storage is accessible from all nodes.  This field is
+	// immutable.
+	//
+	// +optional
+	NodeTopology *metav1.LabelSelector
+
+	// The name of the StorageClass that the reported capacity applies to.
+	// It must meet the same requirements as the name of a StorageClass
+	// object (non-empty, DNS subdomain). If that object no longer exists,
+	// the CSIStorageCapacity object is obsolete and should be removed by its
+	// creator.
+	// This field is immutable.
+	StorageClassName string
+
+	// Capacity is the value reported by the CSI driver in its GetCapacityResponse
+	// for a GetCapacityRequest with topology and parameters that match the
+	// previous fields.
+	//
+	// The semantic is currently (CSI spec 1.2) defined as:
+	// The available capacity, in bytes, of the storage that can be used
+	// to provision volumes. If not set, that information is currently
+	// unavailable.
+	//
+	// +optional
+	Capacity *resource.Quantity
+
+	// MaximumVolumeSize is the value reported by the CSI driver in its GetCapacityResponse
+	// for a GetCapacityRequest with topology and parameters that match the
+	// previous fields.
+	//
+	// This is defined since CSI spec 1.4.0 as the largest size
+	// that may be used in a
+	// CreateVolumeRequest.capacity_range.required_bytes field to
+	// create a volume with the same parameters as those in
+	// GetCapacityRequest. The corresponding value in the Kubernetes
+	// API is ResourceRequirements.Requests in a volume claim.
+	// Not all CSI drivers provide this information.
+	//
+	// +optional
+	MaximumVolumeSize *resource.Quantity
 }
 ```
 
-`AvailableCapacity` is a pointer because `TotalCapacity` and
-`MaximumVolumeSize` might be added later, in which case `nil` for
-`AvailableCapacity` will become allowed.
-
 Compared to the alternatives with a single object per driver (see
 [`CSIDriver.Status`](#csidriverstatus) below) and one object per
 topology (see [`CSIStoragePool`](#csistoragepool)), this approach has
@@ -1129,33 +1154,31 @@ such a scenario, the scheduler has to make those decisions based on
 outdated information, in particular when making one scheduling
 decisions affects the next decision.
 
-We need to investigate:
-- Whether this really is a problem in practice, i.e. identify
-  workloads and drivers where this problem occurs.
-- Whether introducing some simple modeling of capacity helps.
-- Whether prioritization of nodes helps.
-
-For a discussion around modeling storage capacity, see the proposal to
-add ["total capacity" to
-CSI](https://github.com/container-storage-interface/spec/issues/301).
-
-### "Total available capacity" vs. "maximum volume size"
-
-The CSI spec around `GetCapacityResponse.capacity` [is
-vague](https://github.com/container-storage-interface/spec/issues/432)
-because it ignores fragmentation issues. The current Kubernetes API
-proposal follows the design principle that Kubernetes should deviate
-from the CSI spec as little as possible. It therefore directly copies
-that value and thus has the same issue.
-
-The proposed usage (comparison of volume size against available
-capacity) works either way, but having separate fields for "total
-available capacity" and "maximum volume size" would be more precise
-and enable additional features like even volume spreading by
-prioritizing nodes based on "total available capacity"
-
-The goal is to clarify that first in the CSI spec and then revise the
-Kubernetes API.
+[Scale testing](https://github.com/kubernetes-csi/csi-driver-host-path/blob/f053a7b0c4b719a5808fc47fdb3eba9cdade2067/docs/storage-capacity-tracking.md)
+showed that this can occur for a fake workload that generates
+pods with generic ephemeral inline volumes as quickly as possible: publishing
+CSIStorageCapacity objects was sometimes too slow, so scheduling retries were
+needed. However, this was not a problem and the test completed.  The same test
+failed without storage capacity tracking because pod scheduling eventually got
+stuck. Pure chance was not good enough anymore to find nodes that still had
+free storage capacity. No cases have been reported where this was a problem for
+real workloads either.
+
+Modeling remaining storage capacity in the scheduler is an approach that the
+storage community is not willing to support and considers likely to fail
+because storage is often not simply a linear amount of bytes that can be split
+up arbitrarily. For some records of that discussion see the proposal to add
+["total capacity" to
+CSI](https://github.com/container-storage-interface/spec/issues/301), the newer
+[" addition of
+`maximum_volume_size`](https://github.com/container-storage-interface/spec/pull/470)
+and the [2021 Feb 03 CSI community
+meeting](https://www.youtube.com/watch?v=ZB0Y05jo7-M).
+
+Lack of storage capacity modeling will cause the autoscaler to scale up
+clusters more slowly because it cannot determine in advance that multiple new
+nodes are needed. Scaling up one node at a time is still an improvement over
+not scaling up at all.
 
 ### Prioritization of nodes
 
@@ -1171,6 +1194,11 @@ be achieved by prioritizing nodes, ideally with information about both
 "maximum volume size" (for filtering) and "total available capacity"
 (for prioritization).
 
+Prioritizing nodes based on storage capacity was [discussed on
+Slack](https://kubernetes.slack.com/archives/C09QZFCE5/p1629251024161700). The
+conclusion was to handle this as a new KEP if there is sufficient demand for
+it, which so far doesn't seem to be the case.
+
 ### Integration with [Cluster Autoscaler](https://github.com/kubernetes/autoscaler)
 
 The autoscaler simulates the effect of adding more nodes to the
@@ -1189,9 +1217,62 @@ based on storage capacity:
   to available storage and thus could run on a new node, the
   simulation may decide otherwise.
 
-It may be possible to solve this by pre-configuring some information
-(local storage capacity of future nodes and their CSI topology). This
-needs to be explored further.
+This gets further complicated by the independent development of CSI drivers,
+autoscaler, and cloud provider: autoscaler and cloud provider don't know which
+kinds of volumes a CSI driver will be able to make available on nodes because
+that logic is implemented inside the CSI driver. The CSI driver doesn't know
+about hardware that hasn't been provisioned yet and doesn't know about
+autoscaling.
+
+This problem can be solved by the cluster administrator. They can find out how
+much storage will be made available by new nodes, for example by running
+experiments, and then configure the cluster so that this information is
+available to the autoscaler. This can be done with the existing
+CSIStorageCapacity API for node-local storage as follows:
+
+- When creating a fictional Node object from an existing Node in
+  a node group, autoscaler must modify the topology labels of the CSI
+  driver(s) in the cluster so that they define a new topology segment.
+  For example, topology.hostpath.csi/node=aks-workerpool.* has to
+  be replaced with topology.hostpath.csi/node=aks-workerpool-template.
+  Because these labels are opaque to the autoscaler, the cluster
+  administrator must configure these transformations, for example
+  via regular expression search/replace.
+- For scale up from zero, a label like
+  topology.hostpath.csi/node=aks-workerpool-template must be added to the
+  configuration of the node pool.
+- For each storage class, the cluster administrator can then create
+  CSIStorageCapacity objects that provide the capacity information for these
+  fictional topology segments.
+- When the volume binder plugin for the scheduler runs inside the autoscaler,
+  it works exactly as in the scheduler and will accept nodes where the manually
+  created CSIStorageCapacity indicate that sufficient storage is (or rather,
+  will be) available.
+- Because the CSI driver will not run immediately on new nodes, autoscaler has
+  to wait for it before considering the node ready. If it doesn't do that, it
+  might incorrectly scale up further because storage capacity checks will fail
+  for a new, unused node until the CSI driver provides CSIStorageCapacity
+  objects for it. This can be implemented in a generic way for all CSI drivers
+  by adding a readiness check to the autoscaler that compares the existing
+  CSIStorageCapacity objects against the expected ones for the fictional node.
+
+A proof-of-concept of this approach is available in
+https://github.com/kubernetes/autoscaler/pull/3887 and has been used
+successfully to scale an Azure cluster up and down with csi-driver-host-path as
+CSI driver.
+
+The approach above preserves the separation between the different
+components. Simpler solutions may be possible by adding support for specific
+CSI drivers into custom autoscaler binaries or into operators that control the
+cluster setup.
+
+Network attached storage doesn't need renaming of labels when cloning an
+existing Node. The information published for that Node is also valid for the
+fictional one. Scale up from zero however is problematic: the CSI specification
+does not support listing topology segments that don't have some actual Nodes
+with a running CSI driver on them. Either a CSI specification change or manual
+configuration of the external-provisioner sidecar will be needed to close this
+gap.
 
 ### Alternative solutions