-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DefaultPodTopologySpread graduation to Beta #2011
Conversation
/assign @Huang-Wei @ahg-g Will request PRR after sig approval /hold |
a19a80a
to
a41ef5c
Compare
@@ -249,6 +258,9 @@ To ensure this feature to be rolled out in high quality. Following tests are man | |||
- **Integration Tests**: One integration test for the default rules and one for custom rules. | |||
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`. | |||
The performance should be as close as possible. | |||
[Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads. | |||
- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass. | |||
This test is currently broken in 5k nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue/link we can share here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test runs in OSS, but GKE runs it internally, there is no public link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't run multi-AZ tests in OSS. I did a check in https://k8s-testgrid.appspot.com/ and I couldn't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).
@@ -249,6 +258,9 @@ To ensure this feature to be rolled out in high quality. Following tests are man | |||
- **Integration Tests**: One integration test for the default rules and one for custom rules. | |||
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`. | |||
The performance should be as close as possible. | |||
[Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads. | |||
- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass. | |||
This test is currently broken in 5k nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test runs in OSS, but GKE runs it internally, there is no public link.
|
||
* **Does enabling the feature change any default behavior?** | ||
|
||
Yes, users might experience more spreading of Pods among Nodes and Zones in certain topology distributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify that more spreading will be more noticeable in large clusters (over 100 nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** | ||
|
||
For 100 nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention minimum master size (I think 4 cores)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Mitigations: Disable the Feature Gate DefaultPodTopologySpreading in kube-scheduler. | ||
- Diagnostics: N/A. | ||
- Testing: There are performance dashboards. | ||
- Node utilization is low, when using cluster-autoscaler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain how this is related? seems too indirect to mention here.
I would replace this with: "pods of a service/replicaset/statefulset are not properly spread across nodes/zones", this can be detecting by observing the spread of the pods of a service across nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. Replaced.
/lgtm |
/assign @wojtek-t |
@@ -249,6 +258,9 @@ To ensure this feature to be rolled out in high quality. Following tests are man | |||
- **Integration Tests**: One integration test for the default rules and one for custom rules. | |||
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`. | |||
The performance should be as close as possible. | |||
[Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads. | |||
- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass. | |||
This test is currently broken in 5k nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).
|
||
* **Does enabling the feature change any default behavior?** | ||
|
||
Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the default spreading params? How do we group pods for spreading by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - is the new default spreading preferred or forced (i.e. predicate or priority)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's documented in the KEP already #default-constraints and it continues to be scoring (priority)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - can you please cross-link here?
For 100 nodes, with a 4-core master: | ||
|
||
- Latency for PreScore less than 15ms for 99% percentile. | ||
- Latency for Score less than 50ms for 99% percentile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like quite a lot, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we don't need to break them down I guess, but the total overhead of scoring (pre and actual scoring) should be 15ms at the 99th perecentile and 2ms at the 50th percentile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the numbers from our current performance dashboards. I guess we can include 95th percentile as well. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the 95th percentiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected. | ||
`SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew, | ||
resulting in partial spreading at this scale. | ||
On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In synthetic unit level benchmarks, difference is negligible. But we will have more precise numbers from dashboards when we enable the feature by default (beta).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark is actually a bit outdated, but it's CPU footprint is the same. And the benchmark is for 1000 nodes.
https://github.com/kubernetes/kubernetes/blob/6e3ef0be163566c08398e1e5ff43f87accfb036b/pkg/scheduler/framework/plugins/podtopologyspread/scoring_test.go#L754
This is the legacy counterpart https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/selectorspread/selector_spread_perf_test.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - that's fine.
Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* **Will enabling / using this feature result in non-negligible increase of | ||
resource usage (CPU, RAM, disk, IO, ...) in any components?** | ||
|
||
kube-scheduler needs to use more CPU to calculate Zone spreading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what to use. Is wall time from synthetic benchmarks enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have anything else - that's at least something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
* **Does enabling the feature change any default behavior?** | ||
|
||
Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it - can you please cross-link here?
|
||
* **Are there any tests for feature enablement/disablement?** | ||
|
||
There are unit tests in `pkg/scheduler/algorithmprovider/registry_test.go` that exercise the configuration of `kube-scheduler` with the plugins that correspond to the Feature Gate enablement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are useful, but these aren't purely enablement/disablement - they are checking the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature enablement/disablement leads to a different configuration.
Let me know if the new wording is more clear, or you are asking for other tests.
For 100 nodes, with a 4-core master: | ||
|
||
- Latency for PreScore less than 15ms for 99% percentile. | ||
- Latency for Score less than 50ms for 99% percentile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the 95th percentiles?
Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected. | ||
`SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew, | ||
resulting in partial spreading at this scale. | ||
On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?
* **Will enabling / using this feature result in non-negligible increase of | ||
resource usage (CPU, RAM, disk, IO, ...) in any components?** | ||
|
||
kube-scheduler needs to use more CPU to calculate Zone spreading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have anything else - that's at least something.
* **What are other known failure modes?** | ||
|
||
- Pod scheduling is slow | ||
- Detection: Pod creation latency is too high. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/creation/startup time/ ?
Or scheduling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startup time. That is the first symptom, and then you narrow down from there :)
ping @wojtek-t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last minor comment - other than that LGTM.
Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected. | ||
`SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew, | ||
resulting in partial spreading at this scale. | ||
On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - that's fine.
Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).
03ae86e
to
2d59bef
Compare
squashed PRR is done #2011 (review) |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
/sig scheduling
Refs #1258
Includes: