Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-1472: storage capacity: update drawbacks #3233
KEP-1472: storage capacity: update drawbacks #3233
Changes from 6 commits
991e4e9
4059c58
2b0db3b
78df78d
c1505fe
dd9ec66
8115cbc
73cb6a3
a3d8015
a395c55
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should document here what does kube-scheduler do in the absence/presence of the field, and how it relates with maximumVolumeSize.
But... I think we can discuss this in the code PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easier to discuss here because then I only need to change one file vs. many.
I added a section about behavior of kube-scheduler above for the entire CSIStorageCapacity struct. That seemed like a more natural place to discuss the relationship between different fields and how kube-scheduler handles an object. So you are saying that I should have that attached to each field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed that. I think that is also part of the generated documentation in https://kubernetes.io/docs/reference/kubernetes-api/, so it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you somehow quantify this in the KEP?
Could we consider that the conclusion for the testing?
Although:
That doesn't seem like a particularly difficult scenario. kube-scheduler wouldn't already schedule more than 100 pods per nodes, plus kube-scheduler spreads by default. So I wouldn't expect that many retries anyways.
Can you try something like
--capacity=fast=20Gi
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What conclusion? That scheduling pods with volumes is 1.58 slower than for pods without volumes? That depends on so many other factors that I am not comfortable putting this into the KEP.
Also note that csi-driver-host-path as essentially no overhead internally for volumes. That is not realistic and (in this experiment) intentionally stresses the control plane and sidecar more than other drivers.
Scheduling 10000 pods (100 per node, 100 nodes) with 10000 volumes is not difficult? That sounds like a lot of volumes to me.
And which volume size? If I keep it at 1Gi then I also need to reduce the number of pods, otherwise not all of them can be started. That will make the problem simpler, not harder.
I'm not sure what you are trying to achieve here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this I think I understand what you mean: your theory is that the 100 pods per node limit influences the outcome and therefore you want to test with 20 pods per node. Makes sense. It doesn't matter whether we reduce the capacity per node or increase the volume size, I'll test with 5Gi volumes (simpler to configure).
Could it be that the Azure cloud is configured to allow more than 100 pods per node? If there really is such a limit, scheduling without storage capacity should have worked, but it didn't: retries were high for 10 nodes and the test timed out for 100 nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I want to test the case where storage can allow significantly less pods than the actual number of pods limit.
I think the hard limit is 110 pods per node, unless they have a patched kubelet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with 100 I was still below that limit. Then the results with the test timeout for 100 nodes without storage capacity tracking makes sense, because the scheduler may have tried to put more than 100 pods onto a single node.
I ran tests with 10 nodes and 20 pods. Now the setup without storage capacity tracking failed. Baseline was 29s, with volumes and capacity tracking 49s, a factor of 1.69 - similar to what was observed before. See kubernetes-csi/csi-driver-host-path#351 for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Numbers look acceptable. Maybe a bigger, more busy cluster could have more issues, but that's harder to proof.
The real problem will be when we need to scale up. I hope you can still bring this discussion back to CSI.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think network-attached storage should be much of a problem.
When we are scaling up a nodepool or a nodegroup, we know in which topology it would end up. So we can query the
CSIStorageCapacity
that matches.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For scale up from zero, network-attached storage has a similar problem: because the CSI driver hasn't provided topology information yet for the non-existent nodes, the cluster admin must set labels manually in the configuration of the node pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is for the case where the nodepool is a new topology domain.
I think it should be responsibility of the CSI driver to setup that up even if no nodes exist. It's network-attached after all. It should come from somewhere and the driver should know about it. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how does the CSI driver know that the information is needed? Suppose the storage backend has many different topology segments, of which none is currently used by a specific Kubernetes cluster. Should the CSI driver report about all of them?
There's also a practical problem here: the CSI spec doesn't have support for listing topology segments for nodes where no CSI driver is running. If we want something like this, we need a new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, why not? It may not work in every case but it seems like it could work in some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scalability and performance issues? It'll depend on the number of those additional segments, of course.
Yes. I just wanted to point out some of the details (conceptual and practical) that'll have to be addressed for this to work. I agree that it can work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate issue is that Cluster Autoscaler simulates scheduling of all the pods (binpacking) before making any autoscaling decisions. The way that capacity is defined here (and how kubernetes/autoscaler#3887 uses the same CSIStorageClass to represent local storage of potentially hundreds of nodes) I think means that CA will not be able to calculate how many nodes are needed for any use-case where available storage is the limiting factor for scheduling.
ex. Assume I have a 1000 pending pods that have relatively low cpu/memory requests, but require all the local storage of a node. CA would think all those pods can fit on just a few nodes (on a single node if cpu/memory requests are low enough) and only add this many nodes. Admittedly, once the nodes are created some pods will be scheduled, CA will see that remaining pods are unschedulable again and trigger another (equally small) scale-up. So theoretically the situation will eventually recover. In practice the latency can be arbitrarily long, making the feature completely unusable in sufficiently large clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that's how CSI specification works and we may not be able to change it, but it seems that capacity specification isn't strictly defined. We could say that for purposes of autoscaling we treat it as sum of all volumes that can be created. This is not a perfect prediction as it doesn't take fragmentation, etc into account, but it seems much more realistic than assuming we can create infinite number of volumes and the only limitation is on the size of individual volume.
We could further explicitly say that any driver where this assumption is too far off is incompatible with autoscaling. Given how popular the autoscaling is in cloud environments, this can hopefully generate some incentive for driver authors to interpret capacity that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that the API is even problematic for kube-scheduler itself. On a sudden scale up, multiple pods will end up failing. then we need to wait for them to fail to create replacement pods. If you are using the Job API, you could hit the
backoffLimit
and the Job would fail. Similarly, other third party APIs (argo, airflow) are less resistant to failures.I agree that such interpretation would be better. Even if it's not a perfect prediction, it would at least generate less failed pods.
As the API stands today, I don't think I'm confident graduating it to GA, specially if changing the interpretation of a field is under consideration. Maybe we instead need to add a new field for the total capacity and scheduler needs to respect in combination with the existing
.status.availableCapacity
. If the field is not set, it means that the capacity is unlimited and we can assume that drivers who choose to leave it empty understand the consequences.Do you think adding that field is possible @pohly? Do you think that's better for scheduling/autoscaling @MaciekPytel?
If you agree, we should go with a v1beta2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @liggitt @roycaihw (API reviewers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I added a more strictly defined "maximum volume size" to the spec. That is what Kubernetes uses if available, instead of the more loosely defined "capacity".
I don't think that making assumptions about how CSI drivers handle storage capacity is going to help. We've discussed that at length in SIG Storage and in CSI meetings and the consensus there was that there are always some vendor specific exceptions that cannot be handled in a generic way.
Failing how? They won't get scheduled unless their volumes were successfully created. Until then they remain in the queue and will eventually land on some node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to go that far. Both existing fields are intentionally marked as optional. We can always add new ones and stop populating the old ones, if we find that this will work better. Nothing in the current discussion indicates that there'll be a need for that, though, therefore I am confident that this API is good for GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an idea how we can support simulated volume provisioning in the autoscaler without (at least initially) extending this API here or the CSI spec:
GetCapacityResponse.available_capacity
and supportmaximum_volume_size
MaximumVolumeSize
field: what remains is the total capacity without any volumes allocatedThere are some implementation challenges, for example the volume binder code that sets "selected node" might not get called at the moment by autoscaler. The in-tree volume binder code must be made a bit more flexible to accept stores instead of creating informers itself, but other than that it doesn't need any special logic. It's also not a generic solution, so we must have a good understanding of which CSI drivers are going to use this to check in advance that they fit the assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
MaximumVolumeSize
is simply removed, Cluster Autoscaler may decide to provision a node even though a workload can't get scheduled there due to its request exceeding the max size (but being less than total capacity). Proper CA support needs to somehow get this information.Also, this doesn't really help in scale-from-0 scenario. CA needs to somehow know how new nodes will look like, without an existing sample node to clone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably shouldn't have posted that idea because we are now back at discussing a solution that doesn't belong into this KEP 😅 Let's continue anyway, but let's beware that this isn't blocking merging of this PR.
The entire approach of modeling volume provisioning in the autoscaler has to make assumptions about the storage system. The key assumption here is "storage is linear" in which case "maximum volume size" becomes irrelevant because any volume whose size is smaller than the remaining capacity can get created. If that isn't close enough to reality, a more complex model will be needed which then might include some information about "volumes have a minimum size, a maximum size, size must be aligned with X GB, etc.".
The CSI driver for local storage cannot help with scale-from-0 because it isn't running. The cloud provider or some other central component would have to be extended to provide information for such a scenario. That's a different approach that gives up flexibility (cannot use arbitrary CSI drivers) in favor of scale-from-0 support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for starting a solution discussion here, happy to take it elsewhere 🙂
Yes, if we want proper autoscaling support, CA has to use a more complex model, which will make sense for most/all storage implementations out there.
I think this could work through a separate API: different driver extensions would create different objects for CA to use for resource predictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the the concern here is signing off on this as an agreed upon solution. Instead let's word this to indicate that further design and investigation is still needed to solve these problems:
And then park the existing discussion under "Alternatives" (don't want to lose the great discussion that's happening here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it meant to be the other way around: the CSI driver provides information to the autoscaler and the autoscaler does the modeling?
I've pushed an update that under "drawbacks" describes the problem and that further work is needed, then under "alternatives" the description of the current prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there may need to be information passed both ways. In order for a CSI driver to estimate a capacity without being able to actually inspect a node, it's going to need information about the future node, such as how many disks are going to be there and their capacities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes is already notoriously hard to use and by pushing our implementation challenges on cluster admin we're only making it that much worse. We should instead try thinking how we can change our implementation to make it as easy as we can.
We have already solved the exact same issue for the node object itself: autoscaler doesn't require user to create a 'template node' object manually based on the observation. We automatically infer it from existing nodes. Can we do the same for CSIStorageCapacity? We can identify all CSIStorageCapacity objects for an existing node in a node group and copy them.
Of course, the issue is passing those in-memory objects to the plugin. But that can be solved, for example by refactoring scheduler code to include CSIStorageCapacity in the scheduler snapshot rather than fetching it from informer in the plugin? We already do exactly that for Node objects (an additional benefit of this solution is that the entire scheduling cycle would use a consistent snapshot of the cluster, eliminating any potential races caused by CSIStorageCapacity object being created mid-scheduling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information reported by the CSI driver for existing nodes is about remaining capacity. We would have to extend the CSI spec to report capacity as if no volumes had been created.
That may be doable, but then what about scale up from zero? That can't work that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we had initialCapacity field in CSIStorageCapacity, set it to be the same as the capacity when the object is first created and never modify it later on? That'd require a bit of extra logic to set the field, but I don't think it requires changes in CSI spec?
Agreed that scale-from-zero is more tricky. I don't see how we can solve it for general case without some extra information coming from somewhere. It does feel very similar to custom resources and GPUs though and for those we defer to cloudproviders to handle the problem. I think a solution that gives more opportunity to either automate or otherwise expose an API consistent with the given provider would be preferred (ex. custom resources are specified in ASG tags).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That assumes that the driver gets to see the node in a "pristine" state and then nothing changes later on. But it isn't unusual for disks to fail or new ones to get attached at runtime. It also doesn't help when the driver gets uninstalled completely which, depending on the deployment approach, also clears the CSIStorageCapacity objects and then gets reinstalled while there are still some volumes. When the driver then comes up, the sidecar doesn't get the full capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this talking about local storage? can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly thinking of local storage here. You are right, that needs to be called out here because for network-attached storage the topology labels don't need to be modified. I'll revise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we copy the available capacity from another node? What if that node has the capacity partially used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why we cannot copy existing CSIStorageCapacity objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there was a CSIStorageCapacityClass object that defines which topologies would get a new CSIStorageCapacity and how much capacity they have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered whether additional information in CSIDriver could be used to avoid this manual configuration step. In the end I decided against it because it would imply that CSI driver developers need to start doing something specifically for the Kubernetes autoscaler.
I'm not against doing something like this if there is sufficient demand. In that case we should create a new KEP about "auto-configuration of capacity handling for autoscaling" (or whatever we want to call it) and in that KEP introduce new alpha-level functionality.
This is similar how other features get improved over time: first the base functionality gets introduced, then later improvements are added. I see this KEP as the base functionality that is sufficient for several use cases and therefore worth graduating to GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree: this could be a resource specific for CA (it could even be a CRD) and, yes, that would be a candidate for a follow up KEP.
I guess we can focus on the main problem: how to predict how many pods can fit in a node :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's better than having every user figure out independently how each CSIDriver behaves and feed that information to CA. Hopefully Kubernetes is big enough to create an incentive for CSI driver developers.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this is starting to feel like a good thing. Pushing this onto users really doesn't feel good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree. But as I said elsewhere, we then need some buy-in from a CSI driver vendor who wants to support this. Otherwise such a new KEP will be stuck at the "get feedback from the field" promotion criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were 2 pending pods, given the semantics of available capacity, autoscaler might think that they both fit in the same node, where in reality they don't. One of the pods would succeed, and the second one would likely fail once it reaches kubelet. This would really hurt scale up E2E scheduling time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "will fail" seems to be a misunderstanding, because the second Pod will not reach kubelet. I agree that autoscaler will not make the right decision immediately (create two new nodes instead of one), but it also will not make a wrong decision and eventually the cluster grows sufficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it will reach the correct size eventually and any 'failure' will be in scheduler not kubelet. That being said the problem with this argument is that eventually may take a really long time (adding tens or hundreds of nodes one-at-a-time) and for vast majority of use-cases autoscaling is only useful if you can get the nodes when you need them. In my experience scale-up/down latency is the single most important requirement for autoscaling and a solution that completely ignores latency isn't really a solution for large-scale systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could have a similar timeout as we have for readiness when there are ignored taints, right @MaciekPytel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah and also for GPUs. As long as we can predict CSIStorageCapacity objects the node will eventually have (which we need to do anyway) our existing solution can be easily adapted for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach suggested here is indeed based on the solution for GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue is that it is very much a PoC. I can see how it can work for a simple test scenario, but I don't think it's a suitable final solution:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other "Drawbacks" listed, like modeling of capacity usage, fragmentation and prioritization. Was there any progress on them? At the very least, were you able to quantify how bad the situation is in a busy cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PMEM-CSI I ran tests that completely filled up a cluster with Pods that have volumes. That worked well in my experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you summarize those results in the KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test with csi-driver-host-path is more suitable for the KEP because it's easier to reproduce and scale up. I can summarize the results from that in the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the summary of kubernetes-csi/csi-driver-host-path#350 in the revised KEP is sufficient. I just need to update the link once that PR is merged. Whoever wants specific numbers then can look up the report.