-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified node-kubelet-flaky job and image spec #17687
Modified node-kubelet-flaky job and image spec #17687
Conversation
/assign @yujuhong @Random-Liu @dchen1107 |
/lgtm |
jobs/e2e_node/perf-config.yaml
Outdated
cosbeta-resource1: | ||
image: cos-beta-81-12871-44-0 | ||
project: cos-cloud | ||
machine: n1-standard-16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can see from kubernetes/kubernetes#65251 that it is supposed to be running on this machine type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the use of a separate image-config split out.
Are there other areas/tests that do performance checking of an individual node?
Original idea all the way back when kubernetes/kubernetes#65249
@@ -177,8 +177,8 @@ periodics: | |||
- --deployment=node | |||
- --gcp-project-type=node-e2e-project | |||
- --gcp-zone=us-west1-b | |||
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml | |||
- --node-test-args= --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/" | |||
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/perf-config.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to use a separate config file as well, because I have no idea why we would need to run 3 copies of this on different operating systems.
fad68ce
to
c25a35d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we ever fully resolved the thoughts from kubernetes/kubernetes#91263 (comment) & #17669 (comment)
I have an action item to write an email to the mailing list and raise the issue elsewhere to see who might want to own these.
@@ -177,8 +177,8 @@ periodics: | |||
- --deployment=node | |||
- --gcp-project-type=node-e2e-project | |||
- --gcp-zone=us-west1-b | |||
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml | |||
- --node-test-args= --kubelet-flags="--cgroups-per-qos=true --cgroup-root=/" | |||
- --node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/perf-cos-81-12871-103-0-config.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I'll be more explicit, I think the filename should be perf-image-config.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha 😅 fixed it!
29e7575
to
adb018a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the flaky job:
➜ e2e_node git:(raise-dockershim-error) rg '\[Flaky\]' -r .
node_perf_test.go
60:var _ = SIGDescribe("Node Performance Testing [Serial] [Slow] .", func() {
It just turns out there's only the one in it right now. So it's a happy accident that these tests are isolated to a single job.
@@ -0,0 +1,9 @@ | |||
--- | |||
images: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know who monitors the output of this (clearly nobody), nor do I know if it surfaces in a dashboard.
machine: n1-standard-16 | ||
metadata: "user-data<test/e2e_node/jenkins/gci-init.yaml,gci-update-strategy=update_disabled" | ||
tests: | ||
- 'Node Performance Testing' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens to be the only test marked flaky left in node_e2e.
I think the idea of both Flaky
regex and 'Node Performance Testing'
is for this to be moved out of flaky and into it's own job eventually. (It should probably be it's own job already, because nothing else is backed by a n1-standard-16
.
f0982e8
to
63892f5
Compare
63892f5
to
3971dd0
Compare
Thank you everyone for your comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/benchamrks/benchmarks/g
jobs/e2e_node/perf-image-config.yaml
Outdated
--- | ||
images: | ||
cos-stable1: | ||
image: cos-81-12871-103-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: cos-81-12871-119-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
3971dd0
to
8b76cec
Compare
@bsdnet what do you mean? |
that means 'replace benchamrks with benchmarks everywhere', I guess. However, I could only find it in the commit message. |
This patch aims to enable the performance benchmarks in the node-kubelet-flaky e2e job to run on an instance with enough cpu and memory to run performance workloads. It also explicitly specifies the `--server-start-timeout` flag to tell the kubelet how long to wait for healthchecks in between restarts. Signed-off-by: alejandrox1 <alarcj137@gmail.com>
8b76cec
to
86e15df
Compare
haha thank you for the pointer! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alejandrox1, bsdnet, MHBauer, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@alejandrox1: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
looks like tensorflow works okay, integersort works okay, and then embarrassingly parallel workload blows up. |
This patch aims to enable the performance benchmarks in the
node-kubelet-flaky e2e job to run on an instance with enough cpu and
memory to run performance workloads.
It also explicitly specifies the
--server-start-timeout
flag to tellthe kubelet how long to wait for healthchecks in between restarts.
xref: kubernetes/kubernetes#91263
Requires kubernetes/kubernetes#91363
Signed-off-by: alejandrox1 alarcj137@gmail.com
/sig node
/priority important-longterm
/kind failing-test