unable to detach ENI that was created by aws-k8s-tester due to permissions issue #70

jaypipes · 2020-01-02T14:53:25Z

When attempting to delete a cluster that was successfully created using aws-k8s-tester eks create cluster, I am getting "failed to detach ENI: AuthFailure: You do not have permission to access the specified resource" errors.

Here is the log output of the time in question, showing deleting the VPC CloudFormation stack failing, force-deleting subnets failing because of dependency violations, and then failure to detach ENIs that subnets were dependent on due to permissions issues:

{"level":"info","ts":"2020-01-02T09:35:17.542-0500","caller":"eks/vpc.go:367","msg":"deleting VPC CFN stack","vpc-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a"}
{"level":"info","ts":"2020-01-02T09:35:18.222-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-02T09:35:38.623-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","initial-wait":"1m30s"}
{"level":"info","ts":"2020-01-02T09:37:08.624-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"1 minute ago"}
{"level":"info","ts":"2020-01-02T09:37:09.090-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"1 minute ago"}
{"level":"info","ts":"2020-01-02T09:37:18.627-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
{"level":"info","ts":"2020-01-02T09:37:38.625-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
{"level":"info","ts":"2020-01-02T09:37:58.629-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
{"level":"info","ts":"2020-01-02T09:38:18.748-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"3 minutes ago"}
{"level":"warn","ts":"2020-01-02T09:38:18.748-0500","caller":"eks/vpc.go:398","msg":"deleting VPC for longer than 3 minutes; initiating force deletion","vpc-id":"vpc-0ac41feb956cec704"}
{"level":"warn","ts":"2020-01-02T09:38:19.344-0500","caller":"eks/vpc.go:408","msg":"tried force-delete subnet","subnet-id":"subnet-07c710757186382b4","error":"DependencyViolation: The subnet 'subnet-07c710757186382b4' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 5382ce4a-7e12-4d22-816b-052329cc54b4"}
{"level":"warn","ts":"2020-01-02T09:38:19.789-0500","caller":"eks/vpc.go:408","msg":"tried force-delete subnet","subnet-id":"subnet-0be1986fa232621a5","error":"DependencyViolation: The subnet 'subnet-0be1986fa232621a5' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 81078339-d414-440c-9b73-41cb1f8627b1"}
{"level":"warn","ts":"2020-01-02T09:38:20.237-0500","caller":"eks/vpc.go:408","msg":"tried force-delete subnet","subnet-id":"subnet-01bd134698a8c42b7","error":"DependencyViolation: The subnet 'subnet-01bd134698a8c42b7' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: cdc65440-5a33-48de-95f7-17304ae3a5ce"}
{"level":"warn","ts":"2020-01-02T09:38:20.237-0500","caller":"eks/vpc.go:422","msg":"cleaning VPC dependencies","vpc-id":"vpc-0ac41feb956cec704"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-0afc8302280e22af7"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-0f263269a369cd713"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-020ecdf50af3a410f"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-06ae894f5173ffb23"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-0958fe720f0008ec1"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-097418533ad55149d"}
{"level":"warn","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:449","msg":"detaching ENI","eni-id":"eni-0afc8302280e22af7"}
{"level":"error","ts":"2020-01-02T09:38:21.150-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: 8be91608-f92e-4e1d-85b5-588bbefe4631","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"error","ts":"2020-01-02T09:38:26.652-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: 40bb0efd-8f1b-472f-9daa-0a2bdc321a0c","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"error","ts":"2020-01-02T09:38:32.156-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: dc6721a1-77b6-4f27-b6a3-f0eadb13b73d","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"error","ts":"2020-01-02T09:38:37.716-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: 91b030ef-4120-4d85-9dd9-f574475eb1d5","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

I have no idea why the user would not have permission to access a resource that it had created just a few minutes earlier when creating the cluster itself... :(

The text was updated successfully, but these errors were encountered:

jaypipes · 2020-01-02T15:13:56Z

I think that the managed node group stuff is not working properly during deletion. I'm going to re-run my tests (after cleaning up all these orphaned objects) without the managed node group stuff and see if that it is indeed isolated to the setup of the MNG.

jaypipes · 2020-01-02T15:39:51Z

After looking at the orphaned objects, it looks like all ENIs that were left around after an aws-k8s-tester delete cluster have descriptions prefixed with "ELB net/" or "ELB app/". Perhaps the load balancers that are created as part of the NLB hello world service/deployment setup are not being properly destroyed? Or perhaps they are being created with a different role than what is trying to delete them?

jaypipes · 2020-01-02T15:57:54Z

Yeah, it looks like what is happening is that the NLB or ALB created by the Service/Deployment in the NLB hello world and ALB2048 tests are left around even after the Kubernetes cluster is deleted. This means that when attempting to delete subnets during the deleteVPC stage, aws-k8s-tester is getting failures due to the subnets having active ENIs on them. And aws-k8s-tester is not able to detach these ENIs because they were created by the IAM role that was running the Kubernetes cluster's LoadBalancer AWS driver, and the Kubernetes cluster has already been deleted at this point, along with the IAM role that created the NLB and ALB. :(

jaypipes · 2020-01-02T16:08:01Z

I think a potential solution to this -- or at least a workaround for the time-being until there is a way to prevent a Kubernetes cluster or namespace from being deleted if there are active cloud LB resources still associated with a deleted k8s LoadBalancer object -- is to simply add more of a wait before deleting the managed node group role and starting the deletion of the VPCs. I'll push up a PR that does just that.

jaypipes · 2020-01-02T17:33:16Z

Related, it seems: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190423-service-lb-finalizer.md

jaypipes · 2020-01-02T17:42:15Z

Note that LB finalizer support was only alpha in 1.15: kubernetes/kubernetes#78262.

Wait for before trying to delete the k8s namespace on cluster down and when MNG and the ALB/NLB testers have been run. Because we asked Kubernetes to delete the NLB hello world and ALB2048 Deployment/Service above, and both of these interact with the underlying Kubernetes AWS cloud provider to clean up the cloud load balancer backing the Service of type LoadBalancer. The calls to delete the Service return immediately (successfully) but the cloud load balancer resources may not have been deleted yet, including the ENIs that were associated with the cloud load balancer. When, later, aws-k8s-tester tries deleting the VPC associated with the test cluster, it will run into permissions issues because the IAM role that created the ENIs associated with the ENIs in subnets associated with the cloud load balancers will no longer exist. Issue aws#70

jaypipes · 2020-01-03T13:42:12Z

Still not fixed. Using v0.5.3 this morning, EKS cluster came up properly, ran NLB and ALB probes successfully, ran integration tests and then started to delete the cluster and got the same output:

{"level":"info","ts":"2020-01-03T08:18:07.851-0500","caller":"eks/eks.go:262","msg":"Up completed","config-path":"/tmp/cni-test/cluster-cni-test-17428/cni-test-17428.yaml","KUBECONFIG":"/tmp/cni-test/cluster-cni-test-17428/kubeconfig","cluster-arn":"arn:aws:eks:us-west-2:750630568209:cluster/eks-2020010312-pdx-us-west-2-tafjy","request-started":"31 minutes ago"}
'aws-k8s-tester eks create cluster' success
{"level":"info","ts":"2020-01-03T08:18:33.750-0500","caller":"eks/eks.go:109","msg":"aws version","aws-cli-path":"/usr/local/bin/aws","aws-version":"aws-cli/1.16.189 Python/3.7.5 Linux/5.3.0-24-generic botocore/1.12.179\n"}
{"level":"info","ts":"2020-01-03T08:18:33.750-0500","caller":"eks/eks.go:115","msg":"mkdir","kubectl-path-dir":"/tmp/aws-k8s-tester"}
{"level":"info","ts":"2020-01-03T08:18:33.750-0500","caller":"eks/eks.go:119","msg":"downloading kubectl","kubectl-path":"/tmp/aws-k8s-tester/kubectl"}
{"level":"info","ts":"2020-01-03T08:18:33.756-0500","caller":"eks/utils.go:44","msg":"downloading","url":"https://amazon-eks.s3-us-west-2.amazonaws.com/1.14.6/2019-08-22/bin/linux/amd64/kubectl"}
{"level":"info","ts":"2020-01-03T08:18:39.318-0500","caller":"eks/utils.go:60","msg":"downloaded","url":"https://amazon-eks.s3-us-west-2.amazonaws.com/1.14.6/2019-08-22/bin/linux/amd64/kubectl","file-path":"/tmp/aws-k8s-tester/kubectl"}
{"level":"info","ts":"2020-01-03T08:18:39.358-0500","caller":"eks/eks.go:152","msg":"kubectl version","kubectl-path":"/tmp/aws-k8s-tester/kubectl","kubectl-version":"Client Version: version.Info{Major:\"1\", Minor:\"14+\", GitVersion:\"v1.14.7-eks-1861c5\", GitCommit:\"1861c597586f84f1498a9f2151c78d8a6bf47814\", GitTreeState:\"clean\", BuildDate:\"2019-09-24T22:12:08Z\", GoVersion:\"go1.12.9\", Compiler:\"gc\", Platform:\"linux/amd64\"}\n"}
{"level":"info","ts":"2020-01-03T08:18:39.358-0500","caller":"awsapi/aws.go:69","msg":"creating session from AWS cred file","path":"/home/jaypipes/.aws/credentials"}
{"level":"info","ts":"2020-01-03T08:18:39.752-0500","caller":"awsapi/aws.go:99","msg":"creating AWS session","account-id":"750630568209","user-id":"AIDA25RIRQEI4RIUYVNYD","arn":"arn:aws:iam::750630568209:user/jaypipes"}
{"level":"info","ts":"2020-01-03T08:18:39.752-0500","caller":"awsapi/aws.go:69","msg":"creating session from AWS cred file","path":"/home/jaypipes/.aws/credentials"}
{"level":"info","ts":"2020-01-03T08:18:39.813-0500","caller":"awsapi/aws.go:99","msg":"creating AWS session","account-id":"750630568209","user-id":"AIDA25RIRQEI4RIUYVNYD","arn":"arn:aws:iam::750630568209:user/jaypipes"}
{"level":"warn","ts":"2020-01-03T08:18:39.824-0500","caller":"eks/eks.go:418","msg":"starting Down","name":"eks-2020010312-pdx-us-west-2-tafjy","config-path":"/tmp/cni-test/cluster-cni-test-17428/cni-test-17428.yaml","cluster-arn":"arn:aws:eks:us-west-2:750630568209:cluster/eks-2020010312-pdx-us-west-2-tafjy"}
{"level":"info","ts":"2020-01-03T08:18:39.824-0500","caller":"eks/key-pair.go:69","msg":"deleting a key pair","key-pair-name":"eks-2020010312-pdx-us-west-2-tafjy-ssh-key-pair"}
{"level":"info","ts":"2020-01-03T08:18:39.824-0500","caller":"eks/cluster-role.go:165","msg":"deleting role CFN stack","cluster-role-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42"}
{"level":"info","ts":"2020-01-03T08:18:40.414-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:18:41.686-0500","caller":"eks/key-pair.go:102","msg":"deleted a key pair","key-pair-name":"eks-2020010312-pdx-us-west-2-tafjy-ssh-key-pair"}
{"level":"info","ts":"2020-01-03T08:18:41.686-0500","caller":"eks/eks.go:485","msg":"waited delete goroutine with no error","name":"EC2 key pair","waited-goroutines":1}
{"level":"info","ts":"2020-01-03T08:18:50.821-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","initial-wait":"25s"}
{"level":"info","ts":"2020-01-03T08:19:15.821-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-cluster-role","current":"DELETE_COMPLETE","want":"DELETE_COMPLETE","reason":"","request-started":"35 seconds ago"}
{"level":"info","ts":"2020-01-03T08:19:15.822-0500","caller":"cloudformation/cloudformation.go:166","msg":"became desired stack status; exiting","stack-status":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:19:15.822-0500","caller":"eks/cluster-role.go:197","msg":"deleted a role","cluster-role-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","cluster-role-arn":"arn:aws:iam::750630568209:role/eks-2020010312-pdx-us-west-2-tafjy-cluster-role","cluster-role-name":"eks-2020010312-pdx-us-west-2-tafjy-cluster-role"}
{"level":"info","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/eks.go:485","msg":"waited delete goroutine with no error","name":"cluster role","waited-goroutines":2}
{"level":"warn","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/namespace.go:36","msg":"skipping namespace delete; empty k8s client"}
{"level":"info","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/managed-node-group.go:595","msg":"deleting managed node group","managed-node-group-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group"}
{"level":"info","ts":"2020-01-03T08:19:16.324-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:19:31.723-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","initial-wait":"2m0s"}
{"level":"info","ts":"2020-01-03T08:21:31.724-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
...snip...
{"level":"info","ts":"2020-01-03T08:28:16.718-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group","current":"DELETE_COMPLETE","want":"DELETE_COMPLETE","reason":"","request-started":"9 minutes ago"}
{"level":"info","ts":"2020-01-03T08:28:16.718-0500","caller":"cloudformation/cloudformation.go:166","msg":"became desired stack status; exiting","stack-status":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:28:16.730-0500","caller":"eks/managed-node-group.go:661","msg":"deleted a managed node group","managed-node-group-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","managed-node-group-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group"}
{"level":"info","ts":"2020-01-03T08:28:16.741-0500","caller":"eks/eks.go:525","msg":"sleeping before node group role deletion","wait":"5s"}
{"level":"info","ts":"2020-01-03T08:28:21.741-0500","caller":"eks/managed-node-group-role.go:163","msg":"deleting managed node group role CFN stack","managed-node-group-role-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518"}
{"level":"info","ts":"2020-01-03T08:28:21.940-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:28:32.346-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518","initial-wait":"1m0s"}
{"level":"info","ts":"2020-01-03T08:29:32.346-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role","current":"DELETE_FAILED","want":"DELETE_COMPLETE","reason":"The following resource(s) failed to delete: [ManagedNodeGroupRole]. ","request-started":"1 minute ago"}

The interesting thing is this line:

{"level":"info","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/eks.go:485","msg":"waited delete goroutine with no error","name":"cluster role","waited-goroutines":2}

what that corresponds to is the following code:

aws-k8s-tester/eks/eks.go

Lines 439 to 487 in 456eccc

    
           waits++ 
        
           go func() { 
        
           	ch <- errorData{name: "cluster role", err: ts.deleteClusterRole()} 
        
           }() 
        
           if ts.cfg.Parameters.ManagedNodeGroupCreate { 
        
           	waits++ 
        
           	go func() { 
        
           		ch <- errorData{name: "EC2 key pair", err: ts.deleteKeyPair()} 
        
           	}() 
        
           	if ts.cfg.AddOnJobEcho.Enable && ts.jobEchoTester != nil { 
        
           		waits++ 
        
           		go func() { 
        
           			ch <- errorData{name: "Job echo", err: ts.jobEchoTester.Delete()} 
        
           		}() 
        
           	} 
        
           	if ts.cfg.AddOnJobPerl.Enable && ts.jobPiTester != nil { 
        
           		waits++ 
        
           		go func() { 
        
           			ch <- errorData{name: "Job Pi", err: ts.jobPiTester.Delete()} 
        
           		}() 
        
           	} 
        
           	if ts.cfg.AddOnALB2048.Enable && ts.alb2048Tester != nil { 
        
           		waits++ 
        
           		go func() { 
        
           			ch <- errorData{name: "ALB", err: ts.alb2048Tester.Delete()} 
        
           		}() 
        
           	} 
        
           	if ts.cfg.AddOnNLBHelloWorld.Enable && ts.nlbHelloWorldTester != nil { 
        
           		waits++ 
        
           		go func() { 
        
           			ch <- errorData{name: "NLB", err: ts.nlbHelloWorldTester.Delete()} 
        
           		}() 
        
           	} 
        
           } 
        
           var errs []string 
        
           for i := 0; i < waits; i++ { 
        
           	if d := <-ch; d.err != nil { 
        
           		ts.lg.Warn("failed to delete", 
        
           			zap.String("name", d.name), 
        
           			zap.Int("waited-goroutines", i+1), 
        
           			zap.Error(d.err), 
        
           		) 
        
           		errs = append(errs, d.err.Error()) 
        
           	} else { 
        
           		ts.lg.Info("waited delete goroutine with no error", zap.String("name", d.name), zap.Int("waited-goroutines", i+1)) 
        
           	} 
        
           }

which, unless I'm mistaken, SHOULD HAVE read "waited-goroutines": 6 since the cluster role, EC2 keypair needed to be deleted and the echo tester, Pi tester, ALB and NLB testers all ran successfully. For some reason those tester's Delete() method isn't being called for some reason...

What that means is that this conditional:

if ts.cfg.Parameters.ManagedNodeGroupCreate

evaluated to true, but these:

if ts.cfg.AddOnALB2048.Enable && ts.alb2048Tester != nil {

did not.

jaypipes · 2020-01-03T15:26:14Z

Think I figured out what's going on. The individual echo/NLB/ALB tester instances are nil when Tester.down() has been called from an invocation of aws-k8s-tester delete cluster.

When aws-k8s-tester create cluster is called, the Tester.up() method ends up creating these sub-tester objects, like so:

		if ts.cfg.AddOnJobEcho.Enable {
			ts.jobEchoTester, err = jobs.New(jobs.Config{
				Logger:    ts.lg,
				Stopc:     ts.stopCreationCh,
				Sig:       ts.interruptSig,
				K8SClient: ts.k8sClientSet,
				Namespace: ts.cfg.Name,
				JobName:   jobs.JobNameEcho,
				Completes: ts.cfg.AddOnJobEcho.Completes,
				Parallels: ts.cfg.AddOnJobEcho.Parallels,
				EchoSize:  ts.cfg.AddOnJobEcho.Size,
			})
			if err != nil {
				return err
			}
			if err := catchInterrupt(ts.lg, ts.stopCreationCh, ts.stopCreationChOnce, ts.interruptSig, ts.jobEchoTester.Create); err != nil {
				return err
			}
		}

However, when Tester.down() is called, there is a check to see if the sub-tester instances are non-nil:

		if ts.cfg.AddOnJobEcho.Enable && ts.jobEchoTester != nil {
			waits++
			go func() {
				ch <- errorData{name: "Job echo", err: ts.jobEchoTester.Delete()}
			}()
		}

and when Tester.down() is called from aws-k8s-tester delete cluster, those sub-tester instances are all nil.

The solution is to create the sub-tester instances but not create the Job in kubernetes.

gyuho · 2020-01-03T18:50:53Z

and when Tester.down() is called from aws-k8s-tester delete cluster, those sub-tester instances are all nil.

Right... we would need update ts.jobEchoTester when down is called via the sub-command.

When tearing down Kubernetes resources created during the job sub-testers (e.g. NLB hello world), Tester.down() was skipping the deletion of the resources when `aws-k8s-tester delete cluster` was being run. This was due to the construction of the sub-tester instances only happening when Tester.up() was called. This patch fixes this by always creating the sub-tester instances on construction of the main Tester object. Issue aws#70

Instead of creating a K8sClient object when creating the subtester configs, pass a pointer to the main Tester struct that has a KubernetesClientSet() method that returns a pointer to clientset.ClientSet. This ensures that a) each subtester object uses a single clientset instance and b) that we can construct the clientset if the Tester has not yet been initialized (by calling updateK8sClientSet() from Tester.KubernetesClientSet) Issue aws#70

The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue aws#686 Issue aws#784 Issue aws#786

The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue #686 Issue #784 Issue #786

gyuho · 2020-01-30T20:03:02Z

I am seeing this same issue still... Deleting service does not delete ENIs somehow.

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue aws#686 Issue aws#784 Issue aws#786 (cherry picked from commit 8291df3)

The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue #686 Issue #784 Issue #786 (cherry picked from commit 8291df3)

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

ref. #70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho · 2020-02-03T17:26:24Z

Discussed with @mogren in person. I will try adding more wait-time between ENI detach and delete with retries.

gyuho · 2020-02-04T23:29:32Z

Made some progress in the latest code changes https://github.com/aws/aws-k8s-tester/blob/master/CHANGELOG-0.5.md.

Add more waits
Add more retries
Delete ELB in the second try if dependent target group is found

https://github.com/aws/aws-k8s-tester/blob/master/eks/elb/elb.go

gyuho · 2020-02-06T21:17:30Z

https://circleci.com/gh/aws/amazon-vpc-cni-k8s/107 had no issue :)

gyuho · 2020-02-06T22:38:38Z

Closing for now. After adding more retries and waits, resource clean-up operation is stable. Will reopen when it happens again.

Thanks @jaypipes !

jaypipes added the bug Something isn't working label Jan 2, 2020

jaypipes self-assigned this Jan 2, 2020

jaypipes mentioned this issue Jan 2, 2020

wait before attempting to delete k8s namespace #71

Merged

jaypipes mentioned this issue Jan 3, 2020

ensure subtester jobs are torn down properly #72

Merged

jaypipes mentioned this issue Jan 3, 2020

integration tests running with updated aws-k8s-tester aws/amazon-vpc-cni-k8s#766

Merged

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

5d819ab

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

c3bd44d

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

80b1b68

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

875f9c6

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

cfd7edb

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

42b0f2b

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

7b82e51

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020

eks/nlb: manual deletion on NLB resources

c208d85

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

067f9b8

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

a89dbc4

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

95588fe

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

c16d3d5

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

63ed835

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

3e03de0

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

624d5b4

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020

eks/nlb: manual deletion on NLB resources

997dc33

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Feb 1, 2020

eks/nlb: manual deletion on NLB resources

25f5c3c

ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho added a commit that referenced this issue Feb 1, 2020

eks/nlb: manual deletion on NLB resources

b4dc997

ref. #70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

gyuho closed this as completed Feb 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to detach ENI that was created by aws-k8s-tester due to permissions issue #70

unable to detach ENI that was created by aws-k8s-tester due to permissions issue #70

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020 •

edited

Loading

jaypipes commented Jan 2, 2020 •

edited

Loading

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020

jaypipes commented Jan 3, 2020 •

edited

Loading

jaypipes commented Jan 3, 2020

gyuho commented Jan 3, 2020

gyuho commented Jan 30, 2020

gyuho commented Feb 3, 2020

gyuho commented Feb 4, 2020

gyuho commented Feb 6, 2020

gyuho commented Feb 6, 2020

unable to detach ENI that was created by aws-k8s-tester due to permissions issue #70

unable to detach ENI that was created by aws-k8s-tester due to permissions issue #70

Comments

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020 • edited Loading

jaypipes commented Jan 2, 2020 • edited Loading

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020

jaypipes commented Jan 2, 2020

jaypipes commented Jan 3, 2020 • edited Loading

jaypipes commented Jan 3, 2020

gyuho commented Jan 3, 2020

gyuho commented Jan 30, 2020

gyuho commented Feb 3, 2020

gyuho commented Feb 4, 2020

gyuho commented Feb 6, 2020

gyuho commented Feb 6, 2020

jaypipes commented Jan 2, 2020 •

edited

Loading

jaypipes commented Jan 2, 2020 •

edited

Loading

jaypipes commented Jan 3, 2020 •

edited

Loading