-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1472: storage capacity: update drawbacks #3233
Conversation
/cc @alculquicondor @xing-yang @x13n @MaciekPytel @mwielgus: just for completeness, can you confirm here that the conclusion from the 2022-02-21 SIG Autoscaling meeting was that this feature can go to GA without kubernetes/autoscaler#3887 being merged? |
The approach above preserves the separation between the different | ||
components. Simpler solutions may be possible by adding support for specific | ||
CSI drivers into custom autoscaler binaries or into operators that control the | ||
cluster setup. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other "Drawbacks" listed, like modeling of capacity usage, fragmentation and prioritization. Was there any progress on them? At the very least, were you able to quantify how bad the situation is in a busy cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PMEM-CSI I ran tests that completely filled up a cluster with Pods that have volumes. That worked well in my experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you summarize those results in the KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test with csi-driver-host-path is more suitable for the KEP because it's easier to reproduce and scale up. I can summarize the results from that in the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the summary of kubernetes-csi/csi-driver-host-path#350 in the revised KEP is sufficient. I just need to update the link once that PR is merged. Whoever wants specific numbers then can look up the report.
@@ -1189,9 +1189,54 @@ based on storage capacity: | |||
to available storage and thus could run on a new node, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think network-attached storage should be much of a problem.
When we are scaling up a nodepool or a nodegroup, we know in which topology it would end up. So we can query the CSIStorageCapacity
that matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For scale up from zero, network-attached storage has a similar problem: because the CSI driver hasn't provided topology information yet for the non-existent nodes, the cluster admin must set labels manually in the configuration of the node pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is for the case where the nodepool is a new topology domain.
I think it should be responsibility of the CSI driver to setup that up even if no nodes exist. It's network-attached after all. It should come from somewhere and the driver should know about it. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how does the CSI driver know that the information is needed? Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?
There's also a practical problem here: the CSI spec doesn't have support for listing topology segments for nodes where no CSI driver is running. If we want something like this, we need a new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?
Sure, why not? It may not work in every case but it seems like it could work in some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, why not?
Scalability and performance issues? It'll depend on the number of those additional segments, of course.
It may not work in every case but it seems like it could work in some?
Yes. I just wanted to point out some of the details (conceptual and practical) that'll have to be addressed for this to work. I agree that it can work.
available to the autoscaler. This can be done with the existing | ||
CSIStorageCapacity API as follows: | ||
|
||
- When creating a fictional Node object from an existing Node in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this talking about local storage? can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly thinking of local storage here. You are right, that needs to be called out here because for network-attached storage the topology labels don't need to be modified. I'll revise.
- For each storage class, the cluster administrator can then create | ||
CSIStorageCapacity objects that provide the capacity information for these | ||
fictional topology segments. | ||
- When the volume binder plugin for the scheduler runs inside the autoscaler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were 2 pending pods, given the semantics of available capacity, autoscaler might think that they both fit in the same node, where in reality they don't. One of the pods would succeed, and the second one would likely fail once it reaches kubelet. This would really hurt scale up E2E scheduling time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "will fail" seems to be a misunderstanding, because the second Pod will not reach kubelet. I agree that autoscaler will not make the right decision immediately (create two new nodes instead of one), but it also will not make a wrong decision and eventually the cluster grows sufficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it will reach the correct size eventually and any 'failure' will be in scheduler not kubelet. That being said the problem with this argument is that eventually may take a really long time (adding tens or hundreds of nodes one-at-a-time) and for vast majority of use-cases autoscaling is only useful if you can get the nodes when you need them. In my experience scale-up/down latency is the single most important requirement for autoscaling and a solution that completely ignores latency isn't really a solution for large-scale systems.
to wait for it before considering the node ready. If it doesn't do that, it | ||
might incorrectly scale up further because storage capacity checks will fail | ||
for a new, unused node until the CSI driver provides CSIStorageCapacity | ||
objects for it. This can be implemented in a generic way for all CSI drivers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could have a similar timeout as we have for readiness when there are ignored taints, right @MaciekPytel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah and also for GPUs. As long as we can predict CSIStorageCapacity objects the node will eventually have (which we need to do anyway) our existing solution can be easily adapted for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach suggested here is indeed based on the solution for GPUs.
- When creating a fictional Node object from an existing Node in | ||
a node group, autoscaler must modify the topology labels of the CSI | ||
driver(s) in the cluster so that they define a new topology segment. | ||
For example, topology.hostpath.csi/node=aks-workerpool.* has to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there was a CSIStorageCapacityClass object that defines which topologies would get a new CSIStorageCapacity and how much capacity they have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered whether additional information in CSIDriver could be used to avoid this manual configuration step. In the end I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.
I'm not against doing something like this if there is sufficient demand. In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.
This is similar how other features get improved over time: first the base functionality gets introduced, then later improvements are added. I see this KEP as the base functionality that is sufficient for several use cases and therefore worth graduating to GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree: this could be a resource specific for CA (it could even be a CRD) and, yes, that would be a candidate for a follow up KEP.
I guess we can focus on the main problem: how to predict how many pods can fit in a node :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.
I think that's better than having every user figure out independently how each CSIDriver behaves and feed that information to CA. Hopefully Kubernetes is big enough to create an incentive for CSI driver developers.
In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler
IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree. But as I said elsewhere, we then need some buy-in from a CSI driver vendor who wants to support this. Otherwise such a new KEP will be stuck at the "get feedback from the field" promotion criteria.
|
||
- When creating a fictional Node object from an existing Node in | ||
a node group, autoscaler must modify the topology labels of the CSI | ||
driver(s) in the cluster so that they define a new topology segment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we copy the available capacity from another node? What if that node has the capacity partially used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we cannot copy existing CSIStorageCapacity objects.
@@ -1189,9 +1189,54 @@ based on storage capacity: | |||
to available storage and thus could run on a new node, the | |||
simulation may decide otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate issue is that Cluster Autoscaler simulates scheduling of all the pods (binpacking) before making any autoscaling decisions. The way that capacity is defined here (and how kubernetes/autoscaler#3887 uses the same CSIStorageClass to represent local storage of potentially hundreds of nodes) I think means that CA will not be able to calculate how many nodes are needed for any use-case where available storage is the limiting factor for scheduling.
ex. Assume I have a 1000 pending pods that have relatively low cpu/memory requests, but require all the local storage of a node. CA would think all those pods can fit on just a few nodes (on a single node if cpu/memory requests are low enough) and only add this many nodes. Admittedly, once the nodes are created some pods will be scheduled, CA will see that remaining pods are unschedulable again and trigger another (equally small) scale-up. So theoretically the situation will eventually recover. In practice the latency can be arbitrarily long, making the feature completely unusable in sufficiently large clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined. We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.
We could further explicitly say that any driver where this assumption is too far off is incompatible with autoscaling. Given how popular the autoscaling is in cloud environments, this can hopefully generate some incentive for driver authors to interpret capacity that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that the API is even problematic for kube-scheduler itself. On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods. If you are using the Job API, you could hit the backoffLimit
and the Job would fail. Similarly, other third party APIs (argo, airflow) are less resistant to failures.
We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.
I agree that such interpretation would be better. Even if it's not a perfect prediction, it would at least generate less failed pods.
As the API stands today, I don't think I'm confident graduating it to GA, specially if changing the interpretation of a field is under consideration. Maybe we instead need to add a new field for the total capacity and scheduler needs to respect in combination with the existing .status.availableCapacity
. If the field is not set, it means that the capacity is unlimited and we can assume that drivers who choose to leave it empty understand the consequences.
Do you think adding that field is possible @pohly? Do you think that's better for scheduling/autoscaling @MaciekPytel?
If you agree, we should go with a v1beta2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined.
That's why I added a more strictly defined "maximum volume size" to the spec. That is what Kubernetes uses if available, instead of the more loosely defined "capacity".
We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.
I don't think that making assumptions about how CSI drivers handle storage capacity is going to help. We've discussed that at length in SIG Storage and in CSI meetings and the consensus there was that there are always some vendor specific exceptions that cannot be handled in a generic way.
On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods.
Failing how? They won't get scheduled unless their volumes were successfully created. Until then they remain in the queue and will eventually land on some node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to go that far. Both existing fields are intentionally marked as optional. We can always add new ones and stop populating the old ones, if we find that this will work better. Nothing in the current discussion indicates that there'll be a need for that, though, therefore I am confident that this API is good for GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an idea how we can support simulated volume provisioning in the autoscaler without (at least initially) extending this API here or the CSI spec:
- the CSI driver must report "total capacity" in
GetCapacityResponse.available_capacity
and supportmaximum_volume_size
- the CSI driver must be local to nodes and have a single topology key with the Kubernetes node name as value
- the CSIDriver object must have a special annotation for the autoscaler ("linear, local storage")
- nothing changes for kube-scheduler
- autoscaler simulates scheduling like this:
- when cloning an existing node to create a node group, it replaces the node name in the CSI driver's topology label
- it clones the CSIStorageCapacity objects for each storage class, changes their topology and removes the
MaximumVolumeSize
field: what remains is the total capacity without any volumes allocated - when the "selected node" label gets set for a PVC while scheduling a pod, the "AvailableCapacity" gets reduces by the size of that PVC: this simulates volume creation on that node and ensures that eventually pods no longer fit the node
There are some implementation challenges, for example the volume binder code that sets "selected node" might not get called at the moment by autoscaler. The in-tree volume binder code must be made a bit more flexible to accept stores instead of creating informers itself, but other than that it doesn't need any special logic. It's also not a generic solution, so we must have a good understanding of which CSI drivers are going to use this to check in advance that they fit the assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MaximumVolumeSize
is simply removed, Cluster Autoscaler may decide to provision a node even though a workload can't get scheduled there due to its request exceeding the max size (but being less than total capacity). Proper CA support needs to somehow get this information.
Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably shouldn't have posted that idea because we are now back at discussing a solution that doesn't belong into this KEP 😅 Let's continue anyway, but let's beware that this isn't blocking merging of this PR.
The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".
Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.
The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for starting a solution discussion here, happy to take it elsewhere 🙂
The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".
Yes, if we want proper autoscaling support, CA has to use a more complex model, which will make sense for most/all storage implementations out there.
The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.
I think this could work through a separate API: different driver extensions would create different objects for CA to use for resource predictions.
about hardware that hasn't been provisioned yet and doesn't know about | ||
autoscaling. | ||
|
||
This problem can be solved by the cluster administrator. They can find out how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes is already notoriously hard to use and by pushing our implementation challenges on cluster admin we're only making it that much worse. We should instead try thinking how we can change our implementation to make it as easy as we can.
We have already solved the exact same issue for the node object itself: autoscaler doesn't require user to create a 'template node' object manually based on the observation. We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity? We can identify all CSIStorageCapacity objects for an existing node in a node group and copy them.
Of course, the issue is passing those in-memory objects to the plugin. But that can be solved, for example by refactoring scheduler code to include CSIStorageCapacity in the scheduler snapshot rather than fetching it from informer in the plugin? We already do exactly that for Node objects (an additional benefit of this solution is that the entire scheduling cycle would use a consistent snapshot of the cluster, eliminating any potential races caused by CSIStorageCapacity object being created mid-scheduling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity?
The information reported by the CSI driver for existing nodes is about remaining capacity. We would have to extend the CSI spec to report capacity as if no volumes had been created.
That may be doable, but then what about scale up from zero? That can't work that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to extend the CSI spec to report capacity as if no volumes had been created
What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on? That'd require a bit of extra logic to set the field, but I don't think it requires changes in CSI spec?
Agreed that scale-from-zero is more tricky. I don't see how we can solve it for general case without some extra information coming from somewhere. It does feel very similar to custom resources and GPUs though and for those we defer to cloudproviders to handle the problem. I think a solution that gives more opportunity to either automate or otherwise expose an API consistent with the given provider would be preferred (ex. custom resources are specified in ASG tags).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on?
That assumes that the driver gets to see the node in a "pristine" state and then nothing changes later on. But it isn't unusual for disks to fail or new ones to get attached at runtime. It also doesn't help when the driver gets uninstalled completely which, depending on the deployment approach, also clears the CSIStorageCapacity objects and then gets reinstalled while there are still some volumes. When the driver then comes up, the sidecar doesn't get the full capacity.
by adding a readiness check to the autoscaler that compares the existing | ||
CSIStorageCapacity objects against the expected ones for the fictional node. | ||
|
||
A proof-of-concept of this approach is available in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue is that it is very much a PoC. I can see how it can work for a simple test scenario, but I don't think it's a suitable final solution:
- The UX is extremely complicated, requiring multiple complex steps from cluster admin. Debugging this will be super hard for anyone not intimately familiar with CA internals.
- I don't think this approach scales to large clusters (see my comment about capacity above).
In the highlevel, it sounds like some of the questions have some answers or there were clear conclusions from SIG storage and CSI meetings. But the KEP is not up to date, so it's hard to make a decision on whether the API is ready for GA. |
As discussed during PR review, the section was focused on local storage without explicitly saying that. Network attached storage needs to be dealt with differently and has a different problem (scale from zero and lacking capacity information).
API review during the implementation led to some changes compared to what had been proposed in the KEP. The separate into spec and status was removed. Later "maximum volume size" was added once it became available in the CSI spec.
I've updated the autoscaler section and copied the actual API description back into the KEP. I appreciate that you are taking the time to review the autoscaler proposal. Your comments about node-local vs. network attached made me aware that scale from zero is still problematic for network attached storage. On the other hand, lack of storage capacity tracking isn't such a big deal for network attached storage and if there is interest, solutions seem possible also for scale up from zero. I agree that usability of that proposal is poor. But I also think that we will need feedback from specific storage providers and users who are facing that problem before we can come up with something simpler and that this shouldn't hold back this KEP. 😓 |
That sounds like the API needs more time in beta stage |
Only if there is an expectation that future changes cannot be done as additions to a GA API. I don't see that, quite the opposite. Simplifying the autoscaler configuration is something that can be done with additional information, perhaps even with an out-of-tree CRD. In that case no changes to the in-tree API will be needed at all. Holding the API in beta penalizes those users who don't need autoscaling. I don't think that this is the right choice. |
In my ideal world, we would have "TotalCapacity" and "MaxVolumeSize" and then I wouldn't see a point of having "AvailableCapacity", so I wouldn't like having that field in a GA API. But maybe this is not realistic. Let me take another look at the updated PR on Monday. |
Can you update all the "Drawbacks" subsections with whatever context was gathered from SIG storage and CSI meetings? |
"Total capacity" is pretty much off the table. The storage community seems to dislike the idea of pretending that storage is linear, and such a "total capacity" only makes sense under that assumption. One of the design decisions during API review was that the CSIStorageCapacity should mirror what CSI provides. The logic then is in the consumer of that information. From that perspective it makes sense to have both values, even if the scheduler only needs one. |
Done. That whole section was partially outdated, thanks for bringing it up. I also reran the scale tests because I hadn't properly written up the results from the first time and because I wanted to make sure that the recent external-provisioner still works. See https://github.com/kubernetes-csi/csi-driver-host-path/blob/d6d9639077691986d676984827ea4dd7ee0c5cce/docs/storage-capacity-tracking.md Key results:
|
/retitle KEP-1472: storage capacity: update drawbacks |
/assign @msau42 For approval. |
I'm not saying we should block GA on this, but I'd like to request that we amend this PR to describe the problems with autoscaler integration (primarily the fact that we CA can't predict the result of scheduling multiple pods) and mention future work will be required to address those issues. |
"Lack of storage capacity modeling will cause the autoscaler to scale up clusters more slowly because it cannot determine in advance that multiple new nodes are needed." is already called out as a drawback under "lack of storage capacity modeling". That something else would need to be done to address that is implied. We could of course mention the need for future work explicitly, but that's a slippery slope towards outlining such a technical solution in a KEP that isn't about that solution. I'd prefer to keep the KEP as it is now, but don't feel strongly about it. I'll let @msau42 decide that. |
Maybe clarify that the autoscaler support PoC needs a lot of work to be easily configurable (we talked about a CA specific CRD that I don't see included in the KEP) and it doesn't address the problem of "lack of storage capacity modeling". And then you can explicitly say: "the CSI API needs work to improve this situation". |
I've added some more text about enhancements for CA. @alculquicondor can you re-add LGTM if you agree with the change? |
/lgtm |
To reiterate my point - the current wording reads to me like the problem of integration with autoscaler is mostly solved, with maybe a few gotchas. I disagree with that. The main part of the problem (autoscaler's ability to predict result of scheduling all the pending pods) is not addressed. I don't see how we can solve it without large changes, likely involving both extending CSI spec and, possibly, changes in scheduler. I'm fine with the fact that those solutions are out of scope. But since you're adding a section about integration with Cluster Autoscaler I think it should clearly list all the problems.
I agree. It's just that the way it's mentioned suggests it's the limitation of PoC implementation and downplays its significance. I'd like to change the accents to make it more clear that the problem is not solved, but there is a workaround implementation that may work for some cases (small scale or user doesn't care about latency). |
@MaciekPytel: I'm not trying to downplay the limitations. I've pushed another commit with "the current understanding is that further work is needed" as conclusion for the prototype. I'd also be happy to get suggestions for the right wording. |
/retest |
still LGTM, but I'll leave the official one to @MaciekPytel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opinion: the intersection of autoscaling and storage capacity will require more than just tweaks to this API. I'm not convinced that blocking this will produce any fundamentally different API, but I also admit I don't have all the autoscaler context.
I think this decision belongs to @msau42
- When creating a fictional Node object from an existing Node in | ||
a node group, autoscaler must modify the topology labels of the CSI | ||
driver(s) in the cluster so that they define a new topology segment. | ||
For example, topology.hostpath.csi/node=aks-workerpool.* has to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler
IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.
@@ -1189,9 +1189,54 @@ based on storage capacity: | |||
to available storage and thus could run on a new node, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?
Sure, why not? It may not work in every case but it seems like it could work in some?
@@ -1189,9 +1189,54 @@ based on storage capacity: | |||
to available storage and thus could run on a new node, the | |||
simulation may decide otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of API stability, we cannot change the interpretation of existing fields. Any new semantic we want in the future would have to be a new field.
Not saying it's the way to go, but a new apiversion could repurpose a field name as long as the round-trip logic converts, right?
that logic is implemented inside the CSI driver. The CSI driver doesn't know | ||
about hardware that hasn't been provisioned yet and doesn't know about | ||
autoscaling. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the the concern here is signing off on this as an agreed upon solution. Instead let's word this to indicate that further design and investigation is still needed to solve these problems:
- How a CSI driver can report capacity on a pristine, simulated node
- How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.
- How to improve the batch scheduling use case
And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the autoscaler can provide a CSI driver with the storage information it needs to be able to model capacity based on the environment it's modeling.
Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?
And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).
I've pushed an update that under "drawbacks" describes the problem and that further work is needed, then under "alternatives" the description of the current prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?
I think there may need to be information passed both ways. In order for a CSI driver to estimate a capacity without being able to actually inspect a node, it's going to need information about the future node, such as how many disks are going to be there and their capacities.
This makes it clear that the prototype is not the suggested solution.
/lgtm The existing autoscaler ideas are parked under "alternatives" for future discussion. For now, the kep notes that further investigation is needed, and autoscaler integration will be tracked in a separate future kep. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
One-line PR description: document autoscaler support
Issue link: Storage Capacity Tracking #1472
Other comments: this is follow-up for KEP-1472: storage capacity tracking: GA #3229 (comment)