cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

liggitt · 2017-06-07T17:30:52Z

seeing a spike in ns cleanup failure: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=Couldn%27t%20delete%20ns#5c121dd9347b4a1cbd0d

looks like it started the morning of 6/6

currently failing 5% of gce-etcd3 jobs

cc @kubernetes/sig-release-misc @kubernetes/sig-testing-misc

liggitt · 2017-06-07T18:51:58Z

@smarterclayton @derekwaynecarr fyi

liggitt · 2017-06-07T19:03:28Z

@mikedanese

liggitt · 2017-06-07T20:04:13Z

possible namespace deletion change ~2 days ago: #46765

smarterclayton · 2017-06-07T20:46:32Z

It's possible it's the initializers as well, I'm delivering a second fix for that area in #47138 in case it's correlated

derekwaynecarr · 2017-06-07T21:57:36Z

@liggitt -- the linked pr seems like it would have not changed the e2e behavior unless that e2e env did not previously have tpr enabled. it does look like it removed static caching of discovery though, but we need to do that anyway now i guess...

…

On Wed, Jun 7, 2017 at 4:47 PM, Clayton Coleman ***@***.***> wrote: It's *possible* it's the initializers as well, I'm delivering a second fix for that area in #47138 <#47138> in case it's correlated — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#47135 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF8dbDcKtjPT5KKuIZ4824gUlJYNImdoks5sBwxFgaJpZM4NzCbh> .

liggitt · 2017-06-07T22:07:59Z

There was a typo in the enablement check that meant dynamic discovery was never run previously

foxish · 2017-06-07T22:48:37Z

/cc. This is stopping every other batched merge.

liggitt · 2017-06-07T23:15:15Z

Agree, seems to be blocking most PRs I look at

liggitt · 2017-06-07T23:16:29Z

I spent a while looking at the PRs around the time of the spike and nothing jumped out at me. I'm really not sure what sig to point this at

liggitt · 2017-06-07T23:23:06Z

cc @kubernetes/sig-api-machinery-bugs @kubernetes/sig-node-bugs as the two sigs most involved in namespace and graceful deletion

yujuhong · 2017-06-08T00:14:12Z

In https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-e2e-gce-etcd3/34991, test timed out around 14:24 because namespace could not be deleted (e2e-tests-clientset-hrxpx )

I0607 14:24:30.078] �[1mSTEP�[0m: Destroying namespace "e2e-tests-clientset-hrxpx" for this suite.
I0607 14:24:30.078] Jun  7 14:24:29.543: INFO: namespace: e2e-tests-clientset-hrxpx, resource: bindings, ignored listing per whitelist
I0607 14:24:30.079] Jun  7 14:24:30.073: INFO: Couldn't delete ns: "e2e-tests-clientset-hrxpx": namespace e2e-tests-clientset-hrxpx was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-tests-clientset-hrxpx was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})

From the apiserver.log, kubelet deleted the only pod podfda938ed-4bc6-11e7-98d7-0242ac110002 in the namespace around 21:21:27, well before the timeout.

I0607 21:19:29.167012       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-clientset-hrxpx: (46.038259ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/6301725] 104.198.229.226:48228]
I0607 21:21:25.167376       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-clientset-hrxpx/pods: (61.400101ms) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/6301725/system:serviceaccount:kube-system:namespace-controller] [::1]:60878]
...
I0607 21:21:26.454099       5 wrap.go:42] DELETE /apis/autoscaling/v1/namespaces/e2e-tests-clientset-hrxpx/horizontalpodautoscalers: (424.983302ms) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/6301725/system:serviceaccount:kube-system:namespace-controller] [::1]:60878]
I0607 21:21:27.083397       5 wrap.go:42] DELETE /apis/rbac.authorization.k8s.io/v1alpha1/namespaces/e2e-tests-clientset-hrxpx/rolebindings: (59.669782ms) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/6301725/system:serviceaccount:kube-system:namespace-controller] [::1]:60878]
I0607 21:21:27.316572       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-clientset-hrxpx/pods/podfda938ed-4bc6-11e7-98d7-0242ac110002: (48.864897ms) 200 [[kubelet/v1.8.0 (linux/amd64) kubernetes/6301725] 35.192.22.196:35513]
I0607 21:21:27.332655       5 wrap.go:42] GET /api/v1/namespaces/e2e-tests-clientset-hrxpx/pods/podfda938ed-4bc6-11e7-98d7-0242ac110002: (8.507988ms) 404 [[kubelet/v1.8.0 (linux/amd64) kubernetes/6301725] 35.192.22.196:35513]

The controller manager also acknowledged the pod deletion in its log, but I have no idea why it said unexpected items still remain in namespace: e2e-tests-clientset-hrxpx for gvr: { v1 pods}, and whether that's something of interest. /cc @nikhiljindal who last touched this code path.

I0607 21:21:35.522238       5 taint_controller.go:341] Noticed pod deletion: types.NamespacedName{Namespace:"e2e-tests-clientset-hrxpx", Name:"podfda938ed-4bc6-11e7-98d7-0242ac110002"}
I0607 21:21:40.226220       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hrxpx" (30.5502549s)
E0607 21:21:40.226281       5 namespace_controller.go:148] unexpected items still remain in namespace: e2e-tests-clientset-hrxpx for gvr: { v1 pods}
I0607 21:21:51.269383       5 serviceaccounts_controller.go:184] Finished syncing namespace "e2e-tests-clientset-hrxpx" (1.668µs)
I0607 21:21:51.818769       5 tokens_controller.go:249] syncServiceAccount(e2e-tests-clientset-hrxpx/default), service account deleted, removing tokens
I0607 21:24:00.418124       5 namespaced_resources_deleter.go:496] namespace controller - deleteAllContent - namespace: e2e-tests-clientset-hrxpx
I0607 21:24:29.536174       5 namespaced_resources_deleter.go:522] namespace controller - deleteAllContent - namespace: e2e-tests-clientset-hrxpx, estimate: 0
I0607 21:24:30.702227       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hrxpx" (31.446592756s)
I0607 21:24:30.738082       5 serviceaccounts_controller.go:184] Finished syncing namespace "e2e-tests-clientset-hrxpx" (19.782µs)
I0607 21:27:05.484057       5 namespace_controller.go:171] Namespace has been deleted e2e-tests-clientset-hrxpx
I0607 21:27:05.484071       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hrxpx" (30.992µs)

liggitt · 2017-06-08T05:29:57Z

seeing some odd things:

super slow namespace deletion (1-5 minutes spent on a single namespace)
the kubelet trying to report status on what is supposed to be a succeeded pod
the kubelet trying to mount volumes for a pod 2 minutes after it observed the pod being deleted

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46796/pull-kubernetes-e2e-gce-etcd3/35035/

the pod succeeds:
I0608 02:53:28.435290       5 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-configmap-nscgm/pods/pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002/status: (159.239308ms) 200 [[kubelet/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 35.192.22.196:54345]
I0608 02:53:28.456525    2580 status_manager.go:453] Status for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-06-08 02:53:26 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-06-08 02:53:26 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-06-08 02:53:26 +0000 UTC Reason: Message:}] Message: Reason: HostIP:10.128.0.5 PodIP:10.100.3.182 StartTime:2017-06-08 02:53:26 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:configmap-volume-test State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2017-06-08 02:53:27 +0000 UTC,FinishedAt:2017-06-08 02:53:27 +0000 UTC,ContainerID:docker://8f4bb4e2e10ef374371ea9642e62544552105d3a8df455605fa5a17947f3740b,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:gcr.io/google_containers/mounttest:0.8 ImageID:docker://sha256:b1d4e98d91a39b7e50c2ccaf3e0c2dbcb19a3ca7b087d58f86a0b517f2915132 ContainerID:docker://8f4bb4e2e10ef374371ea9642e62544552105d3a8df455605fa5a17947f3740b}] QOSClass:BestEffort})
I0608 02:53:28.463717    2580 kuberuntime_manager.go:836] getSandboxIDByPodUID got sandbox IDs ["d4ade197d5042f72d17eb02a78b95464c8d20cd5395813793dd105fb28ef1244"] for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"
W0608 02:53:28.636109    2580 docker_sandbox.go:285] NetworkPlugin kubenet failed on the status hook for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm": Cannot find the network namespace, skipping pod network status for container {"docker" "d4ade197d5042f72d17eb02a78b95464c8d20cd5395813793dd105fb28ef1244"}
I0608 02:53:28.640180    2580 generic.go:345] PLEG: Write status for pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002/e2e-tests-configmap-nscgm: &container.PodStatus{ID:"a4ddbccc-4bf5-11e7-a943-42010a800002", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", Namespace:"e2e-tests-configmap-nscgm", IP:"", ContainerStatuses:[]*container.ContainerStatus{(*container.ContainerStatus)(0xc421ea62a0)}, SandboxStatuses:[]*v1alpha1.PodSandboxStatus{(*v1alpha1.PodSandboxStatus)(0xc4214558b0)}} (err: <nil>)
I0608 02:53:28.640581    2580 kubelet.go:1891] SyncLoop (PLEG): "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)", event: &pleg.PodLifecycleEvent{ID:"a4ddbccc-4bf5-11e7-a943-42010a800002", Type:"ContainerDied", Data:"d4ade197d5042f72d17eb02a78b95464c8d20cd5395813793dd105fb28ef1244"}
I0608 02:53:28.640638    2580 kubelet_pods.go:1177] Generating status for "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"
I0608 02:53:28.640667    2580 helpers.go:84] Already ran container "configmap-volume-test" of pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)", do nothing

it appears the kubelet still has the pod in the volume mount loop:
I0608 02:53:28.640793    2580 volume_manager.go:340] Waiting for volumes to attach and mount for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"

e2e test deletes the pod after it has succeeded:
I0608 02:53:29.432385       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-configmap-nscgm/pods/pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002: (395.468602ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]

kubelet observes the deletion:
I0608 02:53:29.148132    2580 kubelet.go:1873] SyncLoop (DELETE, "api"): "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"
I0608 02:53:29.369733    2580 kubelet.go:1867] SyncLoop (REMOVE, "api"): "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"                                                                              │
I0608 02:53:29.369805    2580 kubelet.go:2051] Failed to delete pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)", err: pod not found        

kubelet tries to update pod status and fails:
W0608 02:53:29.616132    2580 status_manager.go:448] Failed to update status for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)": Operation cannot be fulfilled on pods "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002": StorageError: invalid object, Code: 4, Key: /registry/pods/e2e-tests-configmap-nscgm/pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: a4ddbccc-4bf5-11e7-a943-42010a800002, UID in object meta: 

namespace delete is initiated by the e2e test:
I0608 02:53:30.736101       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-configmap-nscgm: (849.011439ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426] 

namespace controller is busy (slowly) syncing other namespaces:
I0608 02:53:52.796968       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-statefulset-5w7r3" (1m8.738234563s)
I0608 02:53:58.886617       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-multi-az-f397n" (1m2.69736313s)  
I0608 02:54:02.287418       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-svc-latency-qt2h7" (5m2.323954438s)
I0608 02:54:06.238942       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-init-container-hk0gg" (1m5.214912452s)
I0608 02:54:13.957240       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-job-dbtvt" (59.588699428s) 
I0608 02:54:52.189797       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-initializers-zsx5g" (53.303140508s)
I0608 02:55:01.578033       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hj173" (55.33900797s) 
I0608 02:55:02.189367       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-network-vjznp" (59.901765047s)
I0608 02:55:11.229484       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-projected-9s9xv" (57.272057097s)
...

the kubelet is still trying to mount volumes for the pod it observed be deleted:
E0608 02:55:31.641116    2580 kubelet.go:1602] Unable to mount volumes for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]; skipping pod
E0608 02:55:31.641171    2580 pod_workers.go:182] Error syncing pod a4ddbccc-4bf5-11e7-a943-42010a800002 ("pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"), skipping: timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]

kubelet cannot record events because the namespace is terminating:
I0608 02:55:31.641512    2580 server.go:351] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]
I0608 02:55:31.641536    2580 server.go:351] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]
I0608 02:55:32.108092       5 wrap.go:42] POST /api/v1/namespaces/e2e-tests-configmap-nscgm/events: (543.724µs) 403 [[kubelet/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 35.192.22.196:54345]
E0608 02:55:32.203810    2580 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61e6d74", GenerateName:"", Namespace:"e2e-tests-configmap-nscgm", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod \"e2e-tests-configmap-nscgm\"/\"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002\". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]", Source:v1.EventSource{Component:"kubelet", Host:"e2e-35035-minion-group-tkqf"}, FirstTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641191796, loc:(*time.Location)(0x57fe020)}}, LastTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641191796, loc:(*time.Location)(0x57fe020)}}, Count:1, Type:"Warning"}': 'events "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61e6d74" is forbidden: unable to create new content in namespace e2e-tests-configmap-nscgm because it is being terminated.' (will not retry!)
I0608 02:55:32.336387       5 wrap.go:42] POST /api/v1/namespaces/e2e-tests-configmap-nscgm/events: (6.384372ms) 403 [[kubelet/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 35.192.22.196:54345]
E0608 02:55:32.352676    2580 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61ce5d4", GenerateName:"", Namespace:"e2e-tests-configmap-nscgm", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}, Reason:"FailedMount", Message:"Unable to mount volumes for pod \"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)\": timeout expired waiting for volumes to attach/mount for pod \"e2e-tests-configmap-nscgm\"/\"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002\". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]", Source:v1.EventSource{Component:"kubelet", Host:"e2e-35035-minion-group-tkqf"}, FirstTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641091540, loc:(*time.Location)(0x57fe020)}}, LastTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641091540, loc:(*time.Location)(0x57fe020)}}, Count:1, Type:"Warning"}': 'events "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61ce5d4" is forbidden: unable to create new content in namespace e2e-tests-configmap-nscgm because it is being terminated.' (will not retry!)

the e2e test times out after 5 minutes waiting for the namespace to delete and starts scanning for remaining content to report on:
I0608 02:58:31.226107       5 wrap.go:42] GET /apis/rbac.authorization.k8s.io/v1alpha1/namespaces/e2e-tests-configmap-nscgm/roles: (7.108828ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                               │
I0608 02:58:31.233988       5 wrap.go:42] GET /apis/extensions/v1beta1/namespaces/e2e-tests-configmap-nscgm/daemonsets: (6.01749ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                                           │
I0608 02:58:31.271034       5 wrap.go:42] GET /apis/extensions/v1beta1/namespaces/e2e-tests-configmap-nscgm/replicationcontrollers: (725.384µs) 404 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                              r│
I0608 02:58:31.343285       5 wrap.go:42] GET /api/v1/namespaces/e2e-tests-configmap-nscgm/persistentvolumeclaims: (28.94551ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                                               │

namespace controller finally starts to clean up:
I0608 02:59:05.237681       5 namespaced_resources_deleter.go:496] namespace controller - deleteAllContent - namespace: e2e-tests-configmap-nscgm
I0608 02:59:47.452937       5 namespaced_resources_deleter.go:522] namespace controller - deleteAllContent - namespace: e2e-tests-configmap-nscgm, estimate: 0
I0608 02:59:48.997657       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-configmap-nscgm: (1.13331104s) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/a56f5b2/system:serviceaccount:kube-system:namespace-controller] [::1]:48750]                     │
I0608 02:59:49.252454       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-configmap-nscgm" (44.516925479s)

mikedanese · 2017-06-08T16:54:18Z

Ref #45666 (gzip PR).

mikedanese · 2017-06-08T16:55:19Z

cc @kubernetes/sig-release-bugs @dchen1107 this has been blocking the SQ for 3 days and we still haven't pinned the root cause.

foxish · 2017-06-08T17:13:50Z

The batched merges haven't been blocked by this for the past 12 hours or so. The only PR that may have fixed this that I see is #47086

dchen1107 · 2017-06-08T17:14:00Z

Yes, this is our top 1 issue. So far, ~50% of batch merges failed due to this issue. But I noticed that it is getting better since this morning.

dchen1107 · 2017-06-14T17:32:54Z

Re: #47135 (comment)

@liggitt Do we have more stats on increasing latency on API requests? @gmarek @wojtek-t Given #44003 was fixed for cluster perf dashboard, can we have AB comparison on API latency between 1.6 and 1.7? cc/ @k8s-mirror-api-machinery-misc

If there is no latency downgrade from above measurements, I suggested we close this issue for 1.7. There is no need to artificially change the real production code to fix the test infrastructure given it is not that mature.

On another hand, we should open a new one to figure out the bottleneck on the namespace slowness, and generate the benchmark for this release, so that we can easily decide if there is a performance regression for the future releases.

liggitt · 2017-06-14T19:57:23Z

@liggitt Do we have more stats on increasing latency on API requests?

Discovery API requests are taking ~5-10ms at the beginning of the e2e run and ~200-500ms at the end. I'm trying to capture memory and CPU profiles throughout the e2e run in #47527 to help determine the root cause

dchen1107 · 2017-06-14T23:22:20Z

@liggitt I updated at #47446 (comment)

crimsonfaith91 · 2017-06-15T00:02:11Z

One of my PRs are affected too: #47469

Part of the build log:

I0614 16:28:49.906] • Failure in Spec Teardown (AfterEach) [611.751 seconds]
I0614 16:28:49.906] [k8s.io] InitContainer
I0614 16:28:49.906] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:619
I0614 16:28:49.906]   should not start app containers and fail the pod if init containers fail on a RestartNever pod [AfterEach]
I0614 16:28:49.907]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/init_container.go:452
I0614 16:28:49.907] 
I0614 16:28:49.907]   Jun 14 16:28:49.898: Couldn't delete ns: "e2e-tests-init-container-91xzd": namespace e2e-tests-init-container-91xzd was not deleted with limit: timed out waiting for the condition, pods remaining: 2, pods missing deletion timestamp: 2 (&errors.errorString{s:"namespace e2e-tests-init-container-91xzd was not deleted with limit: timed out waiting for the condition, pods remaining: 2, pods missing deletion timestamp: 2"})

I will follow closely to this issue, and retest the PR frequently.

dchen1107 · 2017-06-17T03:16:28Z

Please my latest update at #47446 (comment)

I am suggesting to close this one too, but I will leave @liggitt to make the final call. Thanks!

liggitt · 2017-06-17T05:45:40Z

since the test infra changes have resolved the blocking issue, and the cause of the performance hit during this test is known, I'm fine with lowering severity and moving out of the milestone. I would like to retitle this and keep it open to briefly investigate if we can make the mutation detector more of a constant cost operation throughout the test run

dchen1107 · 2017-06-20T22:36:28Z

Thanks, @liggitt

k8s-github-robot · 2017-07-25T21:19:16Z

This Issue hasn't been active in 34 days. It will be closed in 55 days (Sep 18, 2017).

cc @liggitt @mikedanese

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

DefaultNamespaceDeletionTimeout was set to 10 minutes at some point because of kubernetes#47135. Another option would be just removing the TODO. Signed-off-by: Miguel Angel Ajo <majopela@redhat.com>

liggitt added kind/flake Categorizes issue or PR as related to a flaky test. queue/blocks-others labels Jun 7, 2017

liggitt added this to the v1.7 milestone Jun 7, 2017

k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 7, 2017

liggitt mentioned this issue Jun 7, 2017

enable Node authorizer and NodeRestriction admission controller #46796

Merged

yujuhong mentioned this issue Jun 7, 2017

Support windows in dockershim #46744

Merged

liggitt added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 7, 2017

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 7, 2017

krzyzacy mentioned this issue Jun 8, 2017

Regenerate docs (if necessary) during cherry-pick operations. #46993

Merged

liggitt mentioned this issue Jun 8, 2017

Disable gzip compression #47161

Closed

dchen1107 mentioned this issue Jun 15, 2017

pull-kubernetes-e2e-gce-etcd3: hack/e2e-internal/e2e-down.sh command killed by keyboard interrupt #47446

Closed

foxish mentioned this issue Jun 16, 2017

Flaky: DisruptionController evictions: enough pods, replicaSet, percentage => should allow an eviction #45706

Closed

liggitt removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. queue/blocks-others labels Jun 17, 2017

liggitt removed this from the v1.7 milestone Jun 17, 2017

liggitt changed the title ~~pull-kubernetes-e2e-gce-etcd3 fails with "Couldn't delete ns" error~~ cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) Jun 17, 2017

liggitt removed the approved-for-milestone label Jun 17, 2017

yujuhong removed their assignment Jun 23, 2017

liggitt added area/test and removed sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 24, 2017

liggitt closed this as completed Jul 26, 2017

piascikj mentioned this issue Nov 3, 2017

(mikedanese): reset this to 5 minutes once #47135 is resolved. imdone/kubernetes#2237

Open

mikedanese mentioned this issue Nov 9, 2017

e2e: revert timeout bump #55422

Closed

mangelajo mentioned this issue May 23, 2019

Restore DefaultNamespaceDeletionTimeout workaround #78255

Closed

tedyu mentioned this issue Oct 3, 2019

Reset default namespace deletion timeout to 5 minutes #83432

Merged

ls-2018 mentioned this issue Aug 29, 2023

[fix] todo: DefaultNamespaceDeletionTimeout openkruise/kruise#1391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

derekwaynecarr commented Jun 7, 2017 via email

liggitt commented Jun 7, 2017

foxish commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

yujuhong commented Jun 8, 2017

liggitt commented Jun 8, 2017 •

edited

Loading

mikedanese commented Jun 8, 2017 •

edited

Loading

mikedanese commented Jun 8, 2017

foxish commented Jun 8, 2017

dchen1107 commented Jun 8, 2017 •

edited

Loading

dchen1107 commented Jun 14, 2017

liggitt commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

crimsonfaith91 commented Jun 15, 2017

dchen1107 commented Jun 17, 2017

liggitt commented Jun 17, 2017

dchen1107 commented Jun 20, 2017

k8s-github-robot commented Jul 25, 2017

cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

Comments

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

smarterclayton commented Jun 7, 2017

derekwaynecarr commented Jun 7, 2017 via email

liggitt commented Jun 7, 2017

foxish commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

liggitt commented Jun 7, 2017

yujuhong commented Jun 8, 2017

liggitt commented Jun 8, 2017 • edited Loading

mikedanese commented Jun 8, 2017 • edited Loading

mikedanese commented Jun 8, 2017

foxish commented Jun 8, 2017

dchen1107 commented Jun 8, 2017 • edited Loading

dchen1107 commented Jun 14, 2017

liggitt commented Jun 14, 2017

dchen1107 commented Jun 14, 2017

crimsonfaith91 commented Jun 15, 2017

dchen1107 commented Jun 17, 2017

liggitt commented Jun 17, 2017

dchen1107 commented Jun 20, 2017

k8s-github-robot commented Jul 25, 2017

liggitt commented Jun 8, 2017 •

edited

Loading

mikedanese commented Jun 8, 2017 •

edited

Loading

dchen1107 commented Jun 8, 2017 •

edited

Loading