Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified node-kubelet-flaky job and image spec #17687

Merged
merged 1 commit into from
Jun 10, 2020

Conversation

alejandrox1
Copy link
Contributor

@alejandrox1 alejandrox1 commented May 22, 2020

This patch aims to enable the performance benchmarks in the
node-kubelet-flaky e2e job to run on an instance with enough cpu and
memory to run performance workloads.
It also explicitly specifies the --server-start-timeout flag to tell
the kubelet how long to wait for healthchecks in between restarts.

xref: kubernetes/kubernetes#91263
Requires kubernetes/kubernetes#91363

Signed-off-by: alejandrox1 alarcj137@gmail.com

/sig node
/priority important-longterm
/kind failing-test

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 22, 2020
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 22, 2020
@alejandrox1
Copy link
Contributor Author

/assign @yujuhong @Random-Liu @dchen1107
ptal

jobs/e2e_node/perf-config.yaml Outdated Show resolved Hide resolved
jobs/e2e_node/perf-config.yaml Outdated Show resolved Hide resolved
@bart0sh
Copy link
Contributor

bart0sh commented May 26, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2020
cosbeta-resource1:
image: cos-beta-81-12871-44-0
project: cos-cloud
machine: n1-standard-16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can see from kubernetes/kubernetes#65251 that it is supposed to be running on this machine type.

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of a separate image-config split out.

Are there other areas/tests that do performance checking of an individual node?

Original idea all the way back when kubernetes/kubernetes#65249

@@ -177,8 +177,8 @@ periodics:
- --deployment=node
- --gcp-project-type=node-e2e-project
- --gcp-zone=us-west1-b
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml
- --node-test-args= --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/perf-config.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to use a separate config file as well, because I have no idea why we would need to run 3 copies of this on different operating systems.

jobs/e2e_node/perf-config.yaml Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 28, 2020
@alejandrox1 alejandrox1 force-pushed the fix-kubelet-flaky-job branch from fad68ce to c25a35d Compare May 30, 2020 22:53
@alejandrox1 alejandrox1 requested a review from MHBauer May 30, 2020 22:56
@alejandrox1
Copy link
Contributor Author

Thank you for the reviews @bart0sh @bsdnet @MHBauer
patched the pr based on your comments.

Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we ever fully resolved the thoughts from kubernetes/kubernetes#91263 (comment) & #17669 (comment)

I have an action item to write an email to the mailing list and raise the issue elsewhere to see who might want to own these.

@@ -177,8 +177,8 @@ periodics:
- --deployment=node
- --gcp-project-type=node-e2e-project
- --gcp-zone=us-west1-b
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml
- --node-test-args= --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/perf-cos-81-12871-103-0-config.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I'll be more explicit, I think the filename should be perf-image-config.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha 😅 fixed it!

@alejandrox1 alejandrox1 force-pushed the fix-kubelet-flaky-job branch 2 times, most recently from 29e7575 to adb018a Compare June 6, 2020 20:03
@k8s-ci-robot k8s-ci-robot requested a review from vpickard June 9, 2020 15:30
Copy link
Contributor

@MHBauer MHBauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the flaky job:

➜  e2e_node git:(raise-dockershim-error) rg '\[Flaky\]' -r .
node_perf_test.go
60:var _ = SIGDescribe("Node Performance Testing [Serial] [Slow] .", func() {

It just turns out there's only the one in it right now. So it's a happy accident that these tests are isolated to a single job.

@@ -0,0 +1,9 @@
---
images:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know who monitors the output of this (clearly nobody), nor do I know if it surfaces in a dashboard.

machine: n1-standard-16
metadata: "user-data<test/e2e_node/jenkins/gci-init.yaml,gci-update-strategy=update_disabled"
tests:
- 'Node Performance Testing'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens to be the only test marked flaky left in node_e2e.

I think the idea of both Flaky regex and 'Node Performance Testing' is for this to be moved out of flaky and into it's own job eventually. (It should probably be it's own job already, because nothing else is backed by a n1-standard-16.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2020
@alejandrox1 alejandrox1 force-pushed the fix-kubelet-flaky-job branch from f0982e8 to 63892f5 Compare June 9, 2020 23:45
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2020
@alejandrox1 alejandrox1 force-pushed the fix-kubelet-flaky-job branch from 63892f5 to 3971dd0 Compare June 9, 2020 23:55
@alejandrox1
Copy link
Contributor Author

Thank you everyone for your comments!
I added an ubuntu image to run these tests on.

Copy link
Contributor

@bsdnet bsdnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/benchamrks/benchmarks/g

---
images:
cos-stable1:
image: cos-81-12871-103-0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cos-81-12871-119-0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2020
@alejandrox1 alejandrox1 force-pushed the fix-kubelet-flaky-job branch from 3971dd0 to 8b76cec Compare June 10, 2020 12:01
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2020
@alejandrox1
Copy link
Contributor Author

s/benchamrks/benchmarks/g

@bsdnet what do you mean?

@bart0sh
Copy link
Contributor

bart0sh commented Jun 10, 2020

s/benchamrks/benchmarks/g

@bsdnet what do you mean?

that means 'replace benchamrks with benchmarks everywhere', I guess. However, I could only find it in the commit message.

This patch aims to enable the performance benchmarks in the
node-kubelet-flaky e2e job to run on an instance with enough cpu and
memory to run performance workloads.
It also explicitly specifies the `--server-start-timeout` flag to tell
the kubelet how long to wait for healthchecks in between restarts.

Signed-off-by: alejandrox1 <alarcj137@gmail.com>
@alejandrox1 alejandrox1 force-pushed the fix-kubelet-flaky-job branch from 8b76cec to 86e15df Compare June 10, 2020 14:22
@alejandrox1
Copy link
Contributor Author

alejandrox1 commented Jun 10, 2020

haha thank you for the pointer!
Fixed the spelling of my commit.
I was worried I had edited some other file or something.

@spiffxp
Copy link
Member

spiffxp commented Jun 10, 2020

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alejandrox1, bsdnet, MHBauer, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit a8d5af8 into kubernetes:master Jun 10, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jun 10, 2020
@k8s-ci-robot
Copy link
Contributor

@alejandrox1: Updated the job-config configmap in namespace default at cluster default using the following files:

  • key node-kubelet.yaml using file config/jobs/kubernetes/sig-node/node-kubelet.yaml

In response to this:

This patch aims to enable the performance benchmarks in the
node-kubelet-flaky e2e job to run on an instance with enough cpu and
memory to run performance workloads.
It also explicitly specifies the --server-start-timeout flag to tell
the kubelet how long to wait for healthchecks in between restarts.

xref: kubernetes/kubernetes#91263
Requires kubernetes/kubernetes#91363

Signed-off-by: alejandrox1 alarcj137@gmail.com

/sig node
/priority important-longterm
/kind failing-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alejandrox1 alejandrox1 deleted the fix-kubelet-flaky-job branch June 10, 2020 19:06
@MHBauer
Copy link
Contributor

MHBauer commented Jun 11, 2020

looks like tensorflow works okay, integersort works okay, and then embarrassingly parallel workload blows up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants