-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135
Comments
possible namespace deletion change ~2 days ago: #46765 |
It's possible it's the initializers as well, I'm delivering a second fix for that area in #47138 in case it's correlated |
@liggitt -- the linked pr seems like it would have not changed the e2e
behavior unless that e2e env did not previously have tpr enabled. it does
look like it removed static caching of discovery though, but we need to do
that anyway now i guess...
…On Wed, Jun 7, 2017 at 4:47 PM, Clayton Coleman ***@***.***> wrote:
It's *possible* it's the initializers as well, I'm delivering a second
fix for that area in #47138
<#47138> in case it's
correlated
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#47135 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF8dbDcKtjPT5KKuIZ4824gUlJYNImdoks5sBwxFgaJpZM4NzCbh>
.
|
There was a typo in the enablement check that meant dynamic discovery was never run previously |
/cc. This is stopping every other batched merge. |
Agree, seems to be blocking most PRs I look at |
I spent a while looking at the PRs around the time of the spike and nothing jumped out at me. I'm really not sure what sig to point this at |
cc @kubernetes/sig-api-machinery-bugs @kubernetes/sig-node-bugs as the two sigs most involved in namespace and graceful deletion |
In https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-e2e-gce-etcd3/34991, test timed out around 14:24 because namespace could not be deleted (
From the apiserver.log, kubelet deleted the only pod
The controller manager also acknowledged the pod deletion in its log, but I have no idea why it said
|
seeing some odd things:
|
Ref #45666 (gzip PR). |
cc @kubernetes/sig-release-bugs @dchen1107 this has been blocking the SQ for 3 days and we still haven't pinned the root cause. |
The batched merges haven't been blocked by this for the past 12 hours or so. The only PR that may have fixed this that I see is #47086 |
Yes, this is our top 1 issue. So far, ~50% of batch merges failed due to this issue. But I noticed that it is getting better since this morning. |
Re: #47135 (comment) @liggitt Do we have more stats on increasing latency on API requests? @gmarek @wojtek-t Given #44003 was fixed for cluster perf dashboard, can we have AB comparison on API latency between 1.6 and 1.7? cc/ @k8s-mirror-api-machinery-misc If there is no latency downgrade from above measurements, I suggested we close this issue for 1.7. There is no need to artificially change the real production code to fix the test infrastructure given it is not that mature. On another hand, we should open a new one to figure out the bottleneck on the namespace slowness, and generate the benchmark for this release, so that we can easily decide if there is a performance regression for the future releases. |
@liggitt I updated at #47446 (comment) |
One of my PRs are affected too: #47469 Part of the build log:
I will follow closely to this issue, and retest the PR frequently. |
Please my latest update at #47446 (comment) I am suggesting to close this one too, but I will leave @liggitt to make the final call. Thanks! |
since the test infra changes have resolved the blocking issue, and the cause of the performance hit during this test is known, I'm fine with lowering severity and moving out of the milestone. I would like to retitle this and keep it open to briefly investigate if we can make the mutation detector more of a constant cost operation throughout the test run |
Thanks, @liggitt |
This Issue hasn't been active in 34 days. It will be closed in 55 days (Sep 18, 2017). You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days |
DefaultNamespaceDeletionTimeout was set to 10 minutes at some point because of kubernetes#47135. Another option would be just removing the TODO. Signed-off-by: Miguel Angel Ajo <majopela@redhat.com>
seeing a spike in ns cleanup failure: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=Couldn%27t%20delete%20ns#5c121dd9347b4a1cbd0d
looks like it started the morning of 6/6
currently failing 5% of gce-etcd3 jobs
cc @kubernetes/sig-release-misc @kubernetes/sig-testing-misc
The text was updated successfully, but these errors were encountered: