Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) #47135

Closed
liggitt opened this issue Jun 7, 2017 · 43 comments
Assignees
Labels
area/test kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@liggitt
Copy link
Member

liggitt commented Jun 7, 2017

seeing a spike in ns cleanup failure: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&text=Couldn%27t%20delete%20ns#5c121dd9347b4a1cbd0d

looks like it started the morning of 6/6

currently failing 5% of gce-etcd3 jobs

cc @kubernetes/sig-release-misc @kubernetes/sig-testing-misc

@liggitt liggitt added kind/flake Categorizes issue or PR as related to a flaky test. queue/blocks-others labels Jun 7, 2017
@liggitt liggitt added this to the v1.7 milestone Jun 7, 2017
@k8s-ci-robot k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 7, 2017
@liggitt
Copy link
Member Author

liggitt commented Jun 7, 2017

@smarterclayton @derekwaynecarr fyi

@liggitt
Copy link
Member Author

liggitt commented Jun 7, 2017

@mikedanese

@liggitt
Copy link
Member Author

liggitt commented Jun 7, 2017

possible namespace deletion change ~2 days ago: #46765

@smarterclayton
Copy link
Contributor

It's possible it's the initializers as well, I'm delivering a second fix for that area in #47138 in case it's correlated

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Jun 7, 2017 via email

@liggitt
Copy link
Member Author

liggitt commented Jun 7, 2017

There was a typo in the enablement check that meant dynamic discovery was never run previously

@foxish
Copy link
Contributor

foxish commented Jun 7, 2017

/cc. This is stopping every other batched merge.

@liggitt liggitt added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 7, 2017
@liggitt
Copy link
Member Author

liggitt commented Jun 7, 2017

Agree, seems to be blocking most PRs I look at

@liggitt
Copy link
Member Author

liggitt commented Jun 7, 2017

I spent a while looking at the PRs around the time of the spike and nothing jumped out at me. I'm really not sure what sig to point this at

@liggitt
Copy link
Member Author

liggitt commented Jun 7, 2017

cc @kubernetes/sig-api-machinery-bugs @kubernetes/sig-node-bugs as the two sigs most involved in namespace and graceful deletion

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 7, 2017
@yujuhong
Copy link
Contributor

yujuhong commented Jun 8, 2017

In https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/batch/pull-kubernetes-e2e-gce-etcd3/34991, test timed out around 14:24 because namespace could not be deleted (e2e-tests-clientset-hrxpx )

I0607 14:24:30.078] �[1mSTEP�[0m: Destroying namespace "e2e-tests-clientset-hrxpx" for this suite.
I0607 14:24:30.078] Jun  7 14:24:29.543: INFO: namespace: e2e-tests-clientset-hrxpx, resource: bindings, ignored listing per whitelist
I0607 14:24:30.079] Jun  7 14:24:30.073: INFO: Couldn't delete ns: "e2e-tests-clientset-hrxpx": namespace e2e-tests-clientset-hrxpx was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-tests-clientset-hrxpx was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})

From the apiserver.log, kubelet deleted the only pod podfda938ed-4bc6-11e7-98d7-0242ac110002 in the namespace around 21:21:27, well before the timeout.

I0607 21:19:29.167012       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-clientset-hrxpx: (46.038259ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/6301725] 104.198.229.226:48228]
I0607 21:21:25.167376       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-clientset-hrxpx/pods: (61.400101ms) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/6301725/system:serviceaccount:kube-system:namespace-controller] [::1]:60878]
...
I0607 21:21:26.454099       5 wrap.go:42] DELETE /apis/autoscaling/v1/namespaces/e2e-tests-clientset-hrxpx/horizontalpodautoscalers: (424.983302ms) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/6301725/system:serviceaccount:kube-system:namespace-controller] [::1]:60878]
I0607 21:21:27.083397       5 wrap.go:42] DELETE /apis/rbac.authorization.k8s.io/v1alpha1/namespaces/e2e-tests-clientset-hrxpx/rolebindings: (59.669782ms) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/6301725/system:serviceaccount:kube-system:namespace-controller] [::1]:60878]
I0607 21:21:27.316572       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-clientset-hrxpx/pods/podfda938ed-4bc6-11e7-98d7-0242ac110002: (48.864897ms) 200 [[kubelet/v1.8.0 (linux/amd64) kubernetes/6301725] 35.192.22.196:35513]
I0607 21:21:27.332655       5 wrap.go:42] GET /api/v1/namespaces/e2e-tests-clientset-hrxpx/pods/podfda938ed-4bc6-11e7-98d7-0242ac110002: (8.507988ms) 404 [[kubelet/v1.8.0 (linux/amd64) kubernetes/6301725] 35.192.22.196:35513]

The controller manager also acknowledged the pod deletion in its log, but I have no idea why it said unexpected items still remain in namespace: e2e-tests-clientset-hrxpx for gvr: { v1 pods}, and whether that's something of interest. /cc @nikhiljindal who last touched this code path.

I0607 21:21:35.522238       5 taint_controller.go:341] Noticed pod deletion: types.NamespacedName{Namespace:"e2e-tests-clientset-hrxpx", Name:"podfda938ed-4bc6-11e7-98d7-0242ac110002"}
I0607 21:21:40.226220       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hrxpx" (30.5502549s)
E0607 21:21:40.226281       5 namespace_controller.go:148] unexpected items still remain in namespace: e2e-tests-clientset-hrxpx for gvr: { v1 pods}
I0607 21:21:51.269383       5 serviceaccounts_controller.go:184] Finished syncing namespace "e2e-tests-clientset-hrxpx" (1.668µs)
I0607 21:21:51.818769       5 tokens_controller.go:249] syncServiceAccount(e2e-tests-clientset-hrxpx/default), service account deleted, removing tokens
I0607 21:24:00.418124       5 namespaced_resources_deleter.go:496] namespace controller - deleteAllContent - namespace: e2e-tests-clientset-hrxpx
I0607 21:24:29.536174       5 namespaced_resources_deleter.go:522] namespace controller - deleteAllContent - namespace: e2e-tests-clientset-hrxpx, estimate: 0
I0607 21:24:30.702227       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hrxpx" (31.446592756s)
I0607 21:24:30.738082       5 serviceaccounts_controller.go:184] Finished syncing namespace "e2e-tests-clientset-hrxpx" (19.782µs)
I0607 21:27:05.484057       5 namespace_controller.go:171] Namespace has been deleted e2e-tests-clientset-hrxpx
I0607 21:27:05.484071       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hrxpx" (30.992µs)

@liggitt
Copy link
Member Author

liggitt commented Jun 8, 2017

seeing some odd things:

  • super slow namespace deletion (1-5 minutes spent on a single namespace)
  • the kubelet trying to report status on what is supposed to be a succeeded pod
  • the kubelet trying to mount volumes for a pod 2 minutes after it observed the pod being deleted

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/46796/pull-kubernetes-e2e-gce-etcd3/35035/

the pod succeeds:
I0608 02:53:28.435290       5 wrap.go:42] PUT /api/v1/namespaces/e2e-tests-configmap-nscgm/pods/pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002/status: (159.239308ms) 200 [[kubelet/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 35.192.22.196:54345]
I0608 02:53:28.456525    2580 status_manager.go:453] Status for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)" updated successfully: (2, {Phase:Succeeded Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-06-08 02:53:26 +0000 UTC Reason:PodCompleted Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-06-08 02:53:26 +0000 UTC Reason:PodCompleted Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2017-06-08 02:53:26 +0000 UTC Reason: Message:}] Message: Reason: HostIP:10.128.0.5 PodIP:10.100.3.182 StartTime:2017-06-08 02:53:26 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:configmap-volume-test State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:0,Signal:0,Reason:Completed,Message:,StartedAt:2017-06-08 02:53:27 +0000 UTC,FinishedAt:2017-06-08 02:53:27 +0000 UTC,ContainerID:docker://8f4bb4e2e10ef374371ea9642e62544552105d3a8df455605fa5a17947f3740b,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:gcr.io/google_containers/mounttest:0.8 ImageID:docker://sha256:b1d4e98d91a39b7e50c2ccaf3e0c2dbcb19a3ca7b087d58f86a0b517f2915132 ContainerID:docker://8f4bb4e2e10ef374371ea9642e62544552105d3a8df455605fa5a17947f3740b}] QOSClass:BestEffort})
I0608 02:53:28.463717    2580 kuberuntime_manager.go:836] getSandboxIDByPodUID got sandbox IDs ["d4ade197d5042f72d17eb02a78b95464c8d20cd5395813793dd105fb28ef1244"] for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"
W0608 02:53:28.636109    2580 docker_sandbox.go:285] NetworkPlugin kubenet failed on the status hook for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm": Cannot find the network namespace, skipping pod network status for container {"docker" "d4ade197d5042f72d17eb02a78b95464c8d20cd5395813793dd105fb28ef1244"}
I0608 02:53:28.640180    2580 generic.go:345] PLEG: Write status for pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002/e2e-tests-configmap-nscgm: &container.PodStatus{ID:"a4ddbccc-4bf5-11e7-a943-42010a800002", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", Namespace:"e2e-tests-configmap-nscgm", IP:"", ContainerStatuses:[]*container.ContainerStatus{(*container.ContainerStatus)(0xc421ea62a0)}, SandboxStatuses:[]*v1alpha1.PodSandboxStatus{(*v1alpha1.PodSandboxStatus)(0xc4214558b0)}} (err: <nil>)
I0608 02:53:28.640581    2580 kubelet.go:1891] SyncLoop (PLEG): "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)", event: &pleg.PodLifecycleEvent{ID:"a4ddbccc-4bf5-11e7-a943-42010a800002", Type:"ContainerDied", Data:"d4ade197d5042f72d17eb02a78b95464c8d20cd5395813793dd105fb28ef1244"}
I0608 02:53:28.640638    2580 kubelet_pods.go:1177] Generating status for "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"
I0608 02:53:28.640667    2580 helpers.go:84] Already ran container "configmap-volume-test" of pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)", do nothing

it appears the kubelet still has the pod in the volume mount loop:
I0608 02:53:28.640793    2580 volume_manager.go:340] Waiting for volumes to attach and mount for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"

e2e test deletes the pod after it has succeeded:
I0608 02:53:29.432385       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-configmap-nscgm/pods/pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002: (395.468602ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]

kubelet observes the deletion:
I0608 02:53:29.148132    2580 kubelet.go:1873] SyncLoop (DELETE, "api"): "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"
I0608 02:53:29.369733    2580 kubelet.go:1867] SyncLoop (REMOVE, "api"): "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"                                                                              │
I0608 02:53:29.369805    2580 kubelet.go:2051] Failed to delete pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)", err: pod not found        

kubelet tries to update pod status and fails:
W0608 02:53:29.616132    2580 status_manager.go:448] Failed to update status for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)": Operation cannot be fulfilled on pods "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002": StorageError: invalid object, Code: 4, Key: /registry/pods/e2e-tests-configmap-nscgm/pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: a4ddbccc-4bf5-11e7-a943-42010a800002, UID in object meta: 

namespace delete is initiated by the e2e test:
I0608 02:53:30.736101       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-configmap-nscgm: (849.011439ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426] 

namespace controller is busy (slowly) syncing other namespaces:
I0608 02:53:52.796968       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-statefulset-5w7r3" (1m8.738234563s)
I0608 02:53:58.886617       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-multi-az-f397n" (1m2.69736313s)  
I0608 02:54:02.287418       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-svc-latency-qt2h7" (5m2.323954438s)
I0608 02:54:06.238942       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-init-container-hk0gg" (1m5.214912452s)
I0608 02:54:13.957240       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-job-dbtvt" (59.588699428s) 
I0608 02:54:52.189797       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-initializers-zsx5g" (53.303140508s)
I0608 02:55:01.578033       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-clientset-hj173" (55.33900797s) 
I0608 02:55:02.189367       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-network-vjznp" (59.901765047s)
I0608 02:55:11.229484       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-projected-9s9xv" (57.272057097s)
...

the kubelet is still trying to mount volumes for the pod it observed be deleted:
E0608 02:55:31.641116    2580 kubelet.go:1602] Unable to mount volumes for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]; skipping pod
E0608 02:55:31.641171    2580 pod_workers.go:182] Error syncing pod a4ddbccc-4bf5-11e7-a943-42010a800002 ("pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)"), skipping: timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]

kubelet cannot record events because the namespace is terminating:
I0608 02:55:31.641512    2580 server.go:351] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]
I0608 02:55:31.641536    2580 server.go:351] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}): type: 'Warning' reason: 'FailedMount' Unable to mount volumes for pod "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)": timeout expired waiting for volumes to attach/mount for pod "e2e-tests-configmap-nscgm"/"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]
I0608 02:55:32.108092       5 wrap.go:42] POST /api/v1/namespaces/e2e-tests-configmap-nscgm/events: (543.724µs) 403 [[kubelet/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 35.192.22.196:54345]
E0608 02:55:32.203810    2580 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61e6d74", GenerateName:"", Namespace:"e2e-tests-configmap-nscgm", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod \"e2e-tests-configmap-nscgm\"/\"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002\". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]", Source:v1.EventSource{Component:"kubelet", Host:"e2e-35035-minion-group-tkqf"}, FirstTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641191796, loc:(*time.Location)(0x57fe020)}}, LastTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641191796, loc:(*time.Location)(0x57fe020)}}, Count:1, Type:"Warning"}': 'events "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61e6d74" is forbidden: unable to create new content in namespace e2e-tests-configmap-nscgm because it is being terminated.' (will not retry!)
I0608 02:55:32.336387       5 wrap.go:42] POST /api/v1/namespaces/e2e-tests-configmap-nscgm/events: (6.384372ms) 403 [[kubelet/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 35.192.22.196:54345]
E0608 02:55:32.352676    2580 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61ce5d4", GenerateName:"", Namespace:"e2e-tests-configmap-nscgm", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-configmap-nscgm", Name:"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002", UID:"a4ddbccc-4bf5-11e7-a943-42010a800002", APIVersion:"v1", ResourceVersion:"13346", FieldPath:""}, Reason:"FailedMount", Message:"Unable to mount volumes for pod \"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002_e2e-tests-configmap-nscgm(a4ddbccc-4bf5-11e7-a943-42010a800002)\": timeout expired waiting for volumes to attach/mount for pod \"e2e-tests-configmap-nscgm\"/\"pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002\". list of unattached/unmounted volumes=[configmap-volume default-token-62kcn]", Source:v1.EventSource{Component:"kubelet", Host:"e2e-35035-minion-group-tkqf"}, FirstTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641091540, loc:(*time.Location)(0x57fe020)}}, LastTimestamp:v1.Time{Time:time.Time{sec:63632487331, nsec:641091540, loc:(*time.Location)(0x57fe020)}}, Count:1, Type:"Warning"}': 'events "pod-configmaps-a4d08477-4bf5-11e7-be23-0242ac110002.14c60601f61ce5d4" is forbidden: unable to create new content in namespace e2e-tests-configmap-nscgm because it is being terminated.' (will not retry!)

the e2e test times out after 5 minutes waiting for the namespace to delete and starts scanning for remaining content to report on:
I0608 02:58:31.226107       5 wrap.go:42] GET /apis/rbac.authorization.k8s.io/v1alpha1/namespaces/e2e-tests-configmap-nscgm/roles: (7.108828ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                               │
I0608 02:58:31.233988       5 wrap.go:42] GET /apis/extensions/v1beta1/namespaces/e2e-tests-configmap-nscgm/daemonsets: (6.01749ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                                           │
I0608 02:58:31.271034       5 wrap.go:42] GET /apis/extensions/v1beta1/namespaces/e2e-tests-configmap-nscgm/replicationcontrollers: (725.384µs) 404 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                              r│
I0608 02:58:31.343285       5 wrap.go:42] GET /api/v1/namespaces/e2e-tests-configmap-nscgm/persistentvolumeclaims: (28.94551ms) 200 [[e2e.test/v1.8.0 (linux/amd64) kubernetes/a56f5b2] 104.197.171.12:41426]                                                               │

namespace controller finally starts to clean up:
I0608 02:59:05.237681       5 namespaced_resources_deleter.go:496] namespace controller - deleteAllContent - namespace: e2e-tests-configmap-nscgm
I0608 02:59:47.452937       5 namespaced_resources_deleter.go:522] namespace controller - deleteAllContent - namespace: e2e-tests-configmap-nscgm, estimate: 0
I0608 02:59:48.997657       5 wrap.go:42] DELETE /api/v1/namespaces/e2e-tests-configmap-nscgm: (1.13331104s) 200 [[kube-controller-manager/v1.8.0 (linux/amd64) kubernetes/a56f5b2/system:serviceaccount:kube-system:namespace-controller] [::1]:48750]                     │
I0608 02:59:49.252454       5 namespace_controller.go:166] Finished syncing namespace "e2e-tests-configmap-nscgm" (44.516925479s) 

@mikedanese
Copy link
Member

mikedanese commented Jun 8, 2017

Ref #45666 (gzip PR).

@mikedanese
Copy link
Member

cc @kubernetes/sig-release-bugs @dchen1107 this has been blocking the SQ for 3 days and we still haven't pinned the root cause.

@foxish
Copy link
Contributor

foxish commented Jun 8, 2017

The batched merges haven't been blocked by this for the past 12 hours or so. The only PR that may have fixed this that I see is #47086

@dchen1107
Copy link
Member

dchen1107 commented Jun 8, 2017

Yes, this is our top 1 issue. So far, ~50% of batch merges failed due to this issue. But I noticed that it is getting better since this morning.

@dchen1107
Copy link
Member

Re: #47135 (comment)

@liggitt Do we have more stats on increasing latency on API requests? @gmarek @wojtek-t Given #44003 was fixed for cluster perf dashboard, can we have AB comparison on API latency between 1.6 and 1.7? cc/ @k8s-mirror-api-machinery-misc

If there is no latency downgrade from above measurements, I suggested we close this issue for 1.7. There is no need to artificially change the real production code to fix the test infrastructure given it is not that mature.

On another hand, we should open a new one to figure out the bottleneck on the namespace slowness, and generate the benchmark for this release, so that we can easily decide if there is a performance regression for the future releases.

@liggitt
Copy link
Member Author

liggitt commented Jun 14, 2017

@liggitt Do we have more stats on increasing latency on API requests?

Discovery API requests are taking ~5-10ms at the beginning of the e2e run and ~200-500ms at the end. I'm trying to capture memory and CPU profiles throughout the e2e run in #47527 to help determine the root cause

@dchen1107
Copy link
Member

@liggitt I updated at #47446 (comment)

@crimsonfaith91
Copy link
Contributor

One of my PRs are affected too: #47469

Part of the build log:

I0614 16:28:49.906] • Failure in Spec Teardown (AfterEach) [611.751 seconds]
I0614 16:28:49.906] [k8s.io] InitContainer
I0614 16:28:49.906] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/framework/framework.go:619
I0614 16:28:49.906]   should not start app containers and fail the pod if init containers fail on a RestartNever pod [AfterEach]
I0614 16:28:49.907]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/common/init_container.go:452
I0614 16:28:49.907] 
I0614 16:28:49.907]   Jun 14 16:28:49.898: Couldn't delete ns: "e2e-tests-init-container-91xzd": namespace e2e-tests-init-container-91xzd was not deleted with limit: timed out waiting for the condition, pods remaining: 2, pods missing deletion timestamp: 2 (&errors.errorString{s:"namespace e2e-tests-init-container-91xzd was not deleted with limit: timed out waiting for the condition, pods remaining: 2, pods missing deletion timestamp: 2"})

I will follow closely to this issue, and retest the PR frequently.

@dchen1107
Copy link
Member

Please my latest update at #47446 (comment)

I am suggesting to close this one too, but I will leave @liggitt to make the final call. Thanks!

@liggitt liggitt removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. queue/blocks-others labels Jun 17, 2017
@liggitt
Copy link
Member Author

liggitt commented Jun 17, 2017

since the test infra changes have resolved the blocking issue, and the cause of the performance hit during this test is known, I'm fine with lowering severity and moving out of the milestone. I would like to retitle this and keep it open to briefly investigate if we can make the mutation detector more of a constant cost operation throughout the test run

@liggitt liggitt removed this from the v1.7 milestone Jun 17, 2017
@liggitt liggitt changed the title pull-kubernetes-e2e-gce-etcd3 fails with "Couldn't delete ns" error cache mutation detector causes memory/cpu pressure at the end of long e2e runs (like pull-kubernetes-e2e-gce-etcd3) Jun 17, 2017
@dchen1107
Copy link
Member

Thanks, @liggitt

@yujuhong yujuhong removed their assignment Jun 23, 2017
@liggitt liggitt added area/test and removed sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 24, 2017
@k8s-github-robot
Copy link

This Issue hasn't been active in 34 days. It will be closed in 55 days (Sep 18, 2017).

cc @liggitt @mikedanese

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@liggitt liggitt closed this as completed Jul 26, 2017
mangelajo added a commit to mangelajo/kubernetes that referenced this issue Jul 3, 2019
DefaultNamespaceDeletionTimeout was set to 10 minutes at some
point because of kubernetes#47135.

Another option would be just removing the TODO.

Signed-off-by: Miguel Angel Ajo <majopela@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests