Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DefaultPodTopologySpread graduation to Beta #2011

Merged
merged 1 commit into from
Oct 2, 2020

Conversation

alculquicondor
Copy link
Member

/sig scheduling

Refs #1258

Includes:

  • Updated API to account for disabling defaults
  • Pending implementation details
  • Production Readiness Questionnaire

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 23, 2020
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 23, 2020
@alculquicondor
Copy link
Member Author

alculquicondor commented Sep 23, 2020

/assign @Huang-Wei @ahg-g

Will request PRR after sig approval

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2020
@alculquicondor alculquicondor force-pushed the default-spread branch 2 times, most recently from a19a80a to a41ef5c Compare September 23, 2020 22:01
@@ -249,6 +258,9 @@ To ensure this feature to be rolled out in high quality. Following tests are man
- **Integration Tests**: One integration test for the default rules and one for custom rules.
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`.
The performance should be as close as possible.
[Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads.
- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass.
This test is currently broken in 5k nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an issue/link we can share here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test runs in OSS, but GKE runs it internally, there is no public link.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't run multi-AZ tests in OSS. I did a check in https://k8s-testgrid.appspot.com/ and I couldn't find it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).

@@ -249,6 +258,9 @@ To ensure this feature to be rolled out in high quality. Following tests are man
- **Integration Tests**: One integration test for the default rules and one for custom rules.
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`.
The performance should be as close as possible.
[Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads.
- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass.
This test is currently broken in 5k nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test runs in OSS, but GKE runs it internally, there is no public link.


* **Does enabling the feature change any default behavior?**

Yes, users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify that more spreading will be more noticeable in large clusters (over 100 nodes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**

For 100 nodes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention minimum master size (I think 4 cores)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- Mitigations: Disable the Feature Gate DefaultPodTopologySpreading in kube-scheduler.
- Diagnostics: N/A.
- Testing: There are performance dashboards.
- Node utilization is low, when using cluster-autoscaler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain how this is related? seems too indirect to mention here.

I would replace this with: "pods of a service/replicaset/statefulset are not properly spread across nodes/zones", this can be detecting by observing the spread of the pods of a service across nodes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. Replaced.

@ahg-g
Copy link
Member

ahg-g commented Sep 24, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2020
@alculquicondor
Copy link
Member Author

/assign @wojtek-t

@@ -249,6 +258,9 @@ To ensure this feature to be rolled out in high quality. Following tests are man
- **Integration Tests**: One integration test for the default rules and one for custom rules.
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`.
The performance should be as close as possible.
[Beta] There should not be any significant degradation in the kubemark benchmark for vanilla workloads.
- **E2E/Conformance Tests**: Test "Multi-AZ Clusters should spread the pods of a {replication controller, service} across zones" should pass.
This test is currently broken in 5k nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OSS we run tests in single-zone clusters only. It's on the roadmap to run nodes across multiple zone (hopefully should happen soon).


* **Does enabling the feature change any default behavior?**

Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the default spreading params? How do we group pods for spreading by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - is the new default spreading preferred or forced (i.e. predicate or priority)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's documented in the KEP already #default-constraints and it continues to be scoring (priority)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - can you please cross-link here?

For 100 nodes, with a 4-core master:

- Latency for PreScore less than 15ms for 99% percentile.
- Latency for Score less than 50ms for 99% percentile.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like quite a lot, isn't it?

Copy link
Member

@ahg-g ahg-g Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we don't need to break them down I guess, but the total overhead of scoring (pre and actual scoring) should be 15ms at the 99th perecentile and 2ms at the 50th percentile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the numbers from our current performance dashboards. I guess we can include 95th percentile as well. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the 95th percentiles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected.
`SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew,
resulting in partial spreading at this scale.
On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any numbers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In synthetic unit level benchmarks, difference is negligible. But we will have more precise numbers from dashboards when we enable the feature by default (beta).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - that's fine.

Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**

kube-scheduler needs to use more CPU to calculate Zone spreading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any numbers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what to use. Is wall time from synthetic benchmarks enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have anything else - that's at least something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


* **Does enabling the feature change any default behavior?**

Yes. Users might experience more spreading of Pods among Nodes and Zones in certain topology distributions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it - can you please cross-link here?


* **Are there any tests for feature enablement/disablement?**

There are unit tests in `pkg/scheduler/algorithmprovider/registry_test.go` that exercise the configuration of `kube-scheduler` with the plugins that correspond to the Feature Gate enablement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are useful, but these aren't purely enablement/disablement - they are checking the configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature enablement/disablement leads to a different configuration.

Let me know if the new wording is more clear, or you are asking for other tests.

For 100 nodes, with a 4-core master:

- Latency for PreScore less than 15ms for 99% percentile.
- Latency for Score less than 50ms for 99% percentile.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the 95th percentiles?

Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected.
`SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew,
resulting in partial spreading at this scale.
On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re benchmarks: even in 5k clusters? Are those using the same config we will have in real clusters?

* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**

kube-scheduler needs to use more CPU to calculate Zone spreading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have anything else - that's at least something.

* **What are other known failure modes?**

- Pod scheduling is slow
- Detection: Pod creation latency is too high.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/creation/startup time/ ?
Or scheduling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startup time. That is the first symptom, and then you narrow down from there :)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 28, 2020
@alculquicondor
Copy link
Member Author

ping @wojtek-t

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last minor comment - other than that LGTM.

Scheduling time on clusters with more than 100 nodes. Smaller clusters are unaffected.
`SelectorSpreading` doesn't take into account all the Nodes in big clusters when calculating skew,
resulting in partial spreading at this scale.
On the contrary, `PodTopologySpreading` considers all nodes when using topologies bigger than a Node, like a Zone.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - that's fine.

Can you please add a note to ensure to validate it during graduation?
Something like:
"Before graduation we will ensure that the latency increase is acceptable with Scalability SIG"
(or sth along those lines).

@alculquicondor
Copy link
Member Author

squashed

PRR is done #2011 (review)

@ahg-g
Copy link
Member

ahg-g commented Oct 2, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2020
@alculquicondor
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2020
@k8s-ci-robot k8s-ci-robot merged commit 40ce94d into kubernetes:master Oct 2, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants