Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1472: storage capacity: update drawbacks #3233

Merged
merged 10 commits into from
Mar 17, 2022
51 changes: 48 additions & 3 deletions keps/sig-storage/1472-storage-capacity-tracking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1189,9 +1189,54 @@ based on storage capacity:
to available storage and thus could run on a new node, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think network-attached storage should be much of a problem.

When we are scaling up a nodepool or a nodegroup, we know in which topology it would end up. So we can query the CSIStorageCapacity that matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For scale up from zero, network-attached storage has a similar problem: because the CSI driver hasn't provided topology information yet for the non-existent nodes, the cluster admin must set labels manually in the configuration of the node pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is for the case where the nodepool is a new topology domain.

I think it should be responsibility of the CSI driver to setup that up even if no nodes exist. It's network-attached after all. It should come from somewhere and the driver should know about it. No?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how does the CSI driver know that the information is needed? Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

There's also a practical problem here: the CSI spec doesn't have support for listing topology segments for nodes where no CSI driver is running. If we want something like this, we need a new API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?

Sure, why not? It may not work in every case but it seems like it could work in some?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not?

Scalability and performance issues? It'll depend on the number of those additional segments, of course.

It may not work in every case but it seems like it could work in some?

Yes. I just wanted to point out some of the details (conceptual and practical) that'll have to be addressed for this to work. I agree that it can work.

simulation may decide otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate issue is that Cluster Autoscaler simulates scheduling of all the pods (binpacking) before making any autoscaling decisions. The way that capacity is defined here (and how kubernetes/autoscaler#3887 uses the same CSIStorageClass to represent local storage of potentially hundreds of nodes) I think means that CA will not be able to calculate how many nodes are needed for any use-case where available storage is the limiting factor for scheduling.

ex. Assume I have a 1000 pending pods that have relatively low cpu/memory requests, but require all the local storage of a node. CA would think all those pods can fit on just a few nodes (on a single node if cpu/memory requests are low enough) and only add this many nodes. Admittedly, once the nodes are created some pods will be scheduled, CA will see that remaining pods are unschedulable again and trigger another (equally small) scale-up. So theoretically the situation will eventually recover. In practice the latency can be arbitrarily long, making the feature completely unusable in sufficiently large clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined. We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

We could further explicitly say that any driver where this assumption is too far off is incompatible with autoscaling. Given how popular the autoscaling is in cloud environments, this can hopefully generate some incentive for driver authors to interpret capacity that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the API is even problematic for kube-scheduler itself. On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods. If you are using the Job API, you could hit the backoffLimit and the Job would fail. Similarly, other third party APIs (argo, airflow) are less resistant to failures.

We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

I agree that such interpretation would be better. Even if it's not a perfect prediction, it would at least generate less failed pods.

As the API stands today, I don't think I'm confident graduating it to GA, specially if changing the interpretation of a field is under consideration. Maybe we instead need to add a new field for the total capacity and scheduler needs to respect in combination with the existing .status.availableCapacity. If the field is not set, it means that the capacity is unlimited and we can assume that drivers who choose to leave it empty understand the consequences.

Do you think adding that field is possible @pohly? Do you think that's better for scheduling/autoscaling @MaciekPytel?

If you agree, we should go with a v1beta2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @liggitt @roycaihw (API reviewers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined.

That's why I added a more strictly defined "maximum volume size" to the spec. That is what Kubernetes uses if available, instead of the more loosely defined "capacity".

We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.

I don't think that making assumptions about how CSI drivers handle storage capacity is going to help. We've discussed that at length in SIG Storage and in CSI meetings and the consensus there was that there are always some vendor specific exceptions that cannot be handled in a generic way.

On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods.

Failing how? They won't get scheduled unless their volumes were successfully created. Until then they remain in the queue and will eventually land on some node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to go that far. Both existing fields are intentionally marked as optional. We can always add new ones and stop populating the old ones, if we find that this will work better. Nothing in the current discussion indicates that there'll be a need for that, though, therefore I am confident that this API is good for GA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an idea how we can support simulated volume provisioning in the autoscaler without (at least initially) extending this API here or the CSI spec:

  • the CSI driver must report "total capacity" in GetCapacityResponse.available_capacity and support maximum_volume_size
  • the CSI driver must be local to nodes and have a single topology key with the Kubernetes node name as value
  • the CSIDriver object must have a special annotation for the autoscaler ("linear, local storage")
  • nothing changes for kube-scheduler
  • autoscaler simulates scheduling like this:
    • when cloning an existing node to create a node group, it replaces the node name in the CSI driver's topology label
    • it clones the CSIStorageCapacity objects for each storage class, changes their topology and removes the MaximumVolumeSize field: what remains is the total capacity without any volumes allocated
    • when the "selected node" label gets set for a PVC while scheduling a pod, the "AvailableCapacity" gets reduces by the size of that PVC: this simulates volume creation on that node and ensures that eventually pods no longer fit the node

There are some implementation challenges, for example the volume binder code that sets "selected node" might not get called at the moment by autoscaler. The in-tree volume binder code must be made a bit more flexible to accept stores instead of creating informers itself, but other than that it doesn't need any special logic. It's also not a generic solution, so we must have a good understanding of which CSI drivers are going to use this to check in advance that they fit the assumptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If MaximumVolumeSize is simply removed, Cluster Autoscaler may decide to provision a node even though a workload can't get scheduled there due to its request exceeding the max size (but being less than total capacity). Proper CA support needs to somehow get this information.

Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably shouldn't have posted that idea because we are now back at discussing a solution that doesn't belong into this KEP 😅 Let's continue anyway, but let's beware that this isn't blocking merging of this PR.

The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".

Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.

The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for starting a solution discussion here, happy to take it elsewhere 🙂

The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".

Yes, if we want proper autoscaling support, CA has to use a more complex model, which will make sense for most/all storage implementations out there.

The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.

I think this could work through a separate API: different driver extensions would create different objects for CA to use for resource predictions.


It may be possible to solve this by pre-configuring some information
(local storage capacity of future nodes and their CSI topology). This
needs to be explored further.
This gets further complicated by the independent development of CSI drivers,
autoscaler, and cloud provider: autoscaler and cloud provider don't know which
kinds of volumes a CSI driver will be able to make available on nodes because
that logic is implemented inside the CSI driver. The CSI driver doesn't know
about hardware that hasn't been provisioned yet and doesn't know about
autoscaling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the the concern here is signing off on this as an agreed upon solution. Instead let's word this to indicate that further design and investigation is still needed to solve these problems:

  • How a CSI driver can report capacity on a pristine, simulated node
  • How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.
  • How to improve the batch scheduling use case

And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.

Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?

And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).

I've pushed an update that under "drawbacks" describes the problem and that further work is needed, then under "alternatives" the description of the current prototype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?

I think there may need to be information passed both ways. In order for a CSI driver to estimate a capacity without being able to actually inspect a node, it's going to need information about the future node, such as how many disks are going to be there and their capacities.

This problem can be solved by the cluster administrator. They can find out how

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes is already notoriously hard to use and by pushing our implementation challenges on cluster admin we're only making it that much worse. We should instead try thinking how we can change our implementation to make it as easy as we can.

We have already solved the exact same issue for the node object itself: autoscaler doesn't require user to create a 'template node' object manually based on the observation. We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity? We can identify all CSIStorageCapacity objects for an existing node in a node group and copy them.
Of course, the issue is passing those in-memory objects to the plugin. But that can be solved, for example by refactoring scheduler code to include CSIStorageCapacity in the scheduler snapshot rather than fetching it from informer in the plugin? We already do exactly that for Node objects (an additional benefit of this solution is that the entire scheduling cycle would use a consistent snapshot of the cluster, eliminating any potential races caused by CSIStorageCapacity object being created mid-scheduling).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity?

The information reported by the CSI driver for existing nodes is about remaining capacity. We would have to extend the CSI spec to report capacity as if no volumes had been created.

That may be doable, but then what about scale up from zero? That can't work that way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to extend the CSI spec to report capacity as if no volumes had been created

What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on? That'd require a bit of extra logic to set the field, but I don't think it requires changes in CSI spec?

Agreed that scale-from-zero is more tricky. I don't see how we can solve it for general case without some extra information coming from somewhere. It does feel very similar to custom resources and GPUs though and for those we defer to cloudproviders to handle the problem. I think a solution that gives more opportunity to either automate or otherwise expose an API consistent with the given provider would be preferred (ex. custom resources are specified in ASG tags).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on?

That assumes that the driver gets to see the node in a "pristine" state and then nothing changes later on. But it isn't unusual for disks to fail or new ones to get attached at runtime. It also doesn't help when the driver gets uninstalled completely which, depending on the deployment approach, also clears the CSIStorageCapacity objects and then gets reinstalled while there are still some volumes. When the driver then comes up, the sidecar doesn't get the full capacity.

much storage will be made available by new nodes, for example by running
experiments, and then configure the cluster so that this information is
available to the autoscaler. This can be done with the existing
CSIStorageCapacity API as follows:

- When creating a fictional Node object from an existing Node in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this talking about local storage? can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly thinking of local storage here. You are right, that needs to be called out here because for network-attached storage the topology labels don't need to be modified. I'll revise.

a node group, autoscaler must modify the topology labels of the CSI
driver(s) in the cluster so that they define a new topology segment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we copy the available capacity from another node? What if that node has the capacity partially used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why we cannot copy existing CSIStorageCapacity objects.

For example, topology.hostpath.csi/node=aks-workerpool.* has to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there was a CSIStorageCapacityClass object that defines which topologies would get a new CSIStorageCapacity and how much capacity they have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered whether additional information in CSIDriver could be used to avoid this manual configuration step. In the end I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.

I'm not against doing something like this if there is sufficient demand. In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.

This is similar how other features get improved over time: first the base functionality gets introduced, then later improvements are added. I see this KEP as the base functionality that is sufficient for several use cases and therefore worth graduating to GA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree: this could be a resource specific for CA (it could even be a CRD) and, yes, that would be a candidate for a follow up KEP.

I guess we can focus on the main problem: how to predict how many pods can fit in a node :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.

I think that's better than having every user figure out independently how each CSIDriver behaves and feed that information to CA. Hopefully Kubernetes is big enough to create an incentive for CSI driver developers.

In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler

IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree. But as I said elsewhere, we then need some buy-in from a CSI driver vendor who wants to support this. Otherwise such a new KEP will be stuck at the "get feedback from the field" promotion criteria.

be replaced with topology.hostpath.csi/node=aks-workerpool-template.
Because these labels are opaque to the autoscaler, the cluster
administrator must configure these transformations, for example
via regular expression search/replace.
- For scale up from zero, a label like
topology.hostpath.csi/node=aks-workerpool-template must be added to the
configuration of the node pool.
- For each storage class, the cluster administrator can then create
CSIStorageCapacity objects that provide the capacity information for these
fictional topology segments.
- When the volume binder plugin for the scheduler runs inside the autoscaler,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were 2 pending pods, given the semantics of available capacity, autoscaler might think that they both fit in the same node, where in reality they don't. One of the pods would succeed, and the second one would likely fail once it reaches kubelet. This would really hurt scale up E2E scheduling time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "will fail" seems to be a misunderstanding, because the second Pod will not reach kubelet. I agree that autoscaler will not make the right decision immediately (create two new nodes instead of one), but it also will not make a wrong decision and eventually the cluster grows sufficiently.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it will reach the correct size eventually and any 'failure' will be in scheduler not kubelet. That being said the problem with this argument is that eventually may take a really long time (adding tens or hundreds of nodes one-at-a-time) and for vast majority of use-cases autoscaling is only useful if you can get the nodes when you need them. In my experience scale-up/down latency is the single most important requirement for autoscaling and a solution that completely ignores latency isn't really a solution for large-scale systems.

it works exactly as in the scheduler and will accept nodes where the manually
created CSIStorageCapacity indicate that sufficient storage is (or rather,
will be) available.
- Because the CSI driver will not run immediately on new nodes, autoscaler has
to wait for it before considering the node ready. If it doesn't do that, it
might incorrectly scale up further because storage capacity checks will fail
for a new, unused node until the CSI driver provides CSIStorageCapacity
objects for it. This can be implemented in a generic way for all CSI drivers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could have a similar timeout as we have for readiness when there are ignored taints, right @MaciekPytel ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah and also for GPUs. As long as we can predict CSIStorageCapacity objects the node will eventually have (which we need to do anyway) our existing solution can be easily adapted for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach suggested here is indeed based on the solution for GPUs.

by adding a readiness check to the autoscaler that compares the existing
CSIStorageCapacity objects against the expected ones for the fictional node.

A proof-of-concept of this approach is available in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue is that it is very much a PoC. I can see how it can work for a simple test scenario, but I don't think it's a suitable final solution:

  • The UX is extremely complicated, requiring multiple complex steps from cluster admin. Debugging this will be super hard for anyone not intimately familiar with CA internals.
  • I don't think this approach scales to large clusters (see my comment about capacity above).

https://github.com/kubernetes/autoscaler/pull/3887 and has been used
successfully to scale an Azure cluster up and down with csi-driver-host-path as
CSI driver.

The approach above preserves the separation between the different
components. Simpler solutions may be possible by adding support for specific
CSI drivers into custom autoscaler binaries or into operators that control the
cluster setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other "Drawbacks" listed, like modeling of capacity usage, fragmentation and prioritization. Was there any progress on them? At the very least, were you able to quantify how bad the situation is in a busy cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PMEM-CSI I ran tests that completely filled up a cluster with Pods that have volumes. That worked well in my experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you summarize those results in the KEP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test with csi-driver-host-path is more suitable for the KEP because it's easier to reproduce and scale up. I can summarize the results from that in the KEP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the summary of kubernetes-csi/csi-driver-host-path#350 in the revised KEP is sufficient. I just need to update the link once that PR is merged. Whoever wants specific numbers then can look up the report.

### Alternative solutions

Expand Down