Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to detach ENI that was created by aws-k8s-tester due to permissions issue #70

Closed
jaypipes opened this issue Jan 2, 2020 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@jaypipes
Copy link
Contributor

jaypipes commented Jan 2, 2020

When attempting to delete a cluster that was successfully created using aws-k8s-tester eks create cluster, I am getting "failed to detach ENI: AuthFailure: You do not have permission to access the specified resource" errors.

Here is the log output of the time in question, showing deleting the VPC CloudFormation stack failing, force-deleting subnets failing because of dependency violations, and then failure to detach ENIs that subnets were dependent on due to permissions issues:

{"level":"info","ts":"2020-01-02T09:35:17.542-0500","caller":"eks/vpc.go:367","msg":"deleting VPC CFN stack","vpc-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a"}
{"level":"info","ts":"2020-01-02T09:35:18.222-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-02T09:35:38.623-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","initial-wait":"1m30s"}
{"level":"info","ts":"2020-01-02T09:37:08.624-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"1 minute ago"}
{"level":"info","ts":"2020-01-02T09:37:09.090-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"1 minute ago"}
{"level":"info","ts":"2020-01-02T09:37:18.627-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
{"level":"info","ts":"2020-01-02T09:37:38.625-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
{"level":"info","ts":"2020-01-02T09:37:58.629-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
{"level":"info","ts":"2020-01-02T09:38:18.748-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010213-pdx-us-west-2-belcb-vpc/04d0d020-2d62-11ea-a307-02ca0801967a","stack-name":"eks-2020010213-pdx-us-west-2-belcb-vpc","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"3 minutes ago"}
{"level":"warn","ts":"2020-01-02T09:38:18.748-0500","caller":"eks/vpc.go:398","msg":"deleting VPC for longer than 3 minutes; initiating force deletion","vpc-id":"vpc-0ac41feb956cec704"}
{"level":"warn","ts":"2020-01-02T09:38:19.344-0500","caller":"eks/vpc.go:408","msg":"tried force-delete subnet","subnet-id":"subnet-07c710757186382b4","error":"DependencyViolation: The subnet 'subnet-07c710757186382b4' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 5382ce4a-7e12-4d22-816b-052329cc54b4"}
{"level":"warn","ts":"2020-01-02T09:38:19.789-0500","caller":"eks/vpc.go:408","msg":"tried force-delete subnet","subnet-id":"subnet-0be1986fa232621a5","error":"DependencyViolation: The subnet 'subnet-0be1986fa232621a5' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 81078339-d414-440c-9b73-41cb1f8627b1"}
{"level":"warn","ts":"2020-01-02T09:38:20.237-0500","caller":"eks/vpc.go:408","msg":"tried force-delete subnet","subnet-id":"subnet-01bd134698a8c42b7","error":"DependencyViolation: The subnet 'subnet-01bd134698a8c42b7' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: cdc65440-5a33-48de-95f7-17304ae3a5ce"}
{"level":"warn","ts":"2020-01-02T09:38:20.237-0500","caller":"eks/vpc.go:422","msg":"cleaning VPC dependencies","vpc-id":"vpc-0ac41feb956cec704"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-0afc8302280e22af7"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-0f263269a369cd713"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-020ecdf50af3a410f"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-06ae894f5173ffb23"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-0958fe720f0008ec1"}
{"level":"info","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:438","msg":"found ENI","eni-id":"eni-097418533ad55149d"}
{"level":"warn","ts":"2020-01-02T09:38:20.722-0500","caller":"eks/vpc.go:449","msg":"detaching ENI","eni-id":"eni-0afc8302280e22af7"}
{"level":"error","ts":"2020-01-02T09:38:21.150-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: 8be91608-f92e-4e1d-85b5-588bbefe4631","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"error","ts":"2020-01-02T09:38:26.652-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: 40bb0efd-8f1b-472f-9daa-0a2bdc321a0c","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"error","ts":"2020-01-02T09:38:32.156-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: dc6721a1-77b6-4f27-b6a3-f0eadb13b73d","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
{"level":"error","ts":"2020-01-02T09:38:37.716-0500","caller":"eks/vpc.go:476","msg":"failed to detach ENI","eni-id":"eni-0afc8302280e22af7","error":"AuthFailure: You do not have permission to access the specified resource.\n\tstatus code: 400, request id: 91b030ef-4120-4d85-9dd9-f574475eb1d5","stacktrace":"github.com/aws/aws-k8s-tester/eks.(*Tester).deleteVPC\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/vpc.go:476\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:526\ngithub.com/aws/aws-k8s-tester/eks.(*Tester).Down\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/eks/eks.go:546\ngithub.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks.deleteClusterFunc\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/eks/delete.go:48\ngithub.com/spf13/cobra.(*Command).execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\t/Users/leegyuho/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864\nmain.main\n\t/Users/leegyuho/go/src/github.com/aws/aws-k8s-tester/cmd/aws-k8s-tester/main.go:37\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

I have no idea why the user would not have permission to access a resource that it had created just a few minutes earlier when creating the cluster itself... :(

@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 2, 2020

I think that the managed node group stuff is not working properly during deletion. I'm going to re-run my tests (after cleaning up all these orphaned objects) without the managed node group stuff and see if that it is indeed isolated to the setup of the MNG.

@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 2, 2020

After looking at the orphaned objects, it looks like all ENIs that were left around after an aws-k8s-tester delete cluster have descriptions prefixed with "ELB net/" or "ELB app/". Perhaps the load balancers that are created as part of the NLB hello world service/deployment setup are not being properly destroyed? Or perhaps they are being created with a different role than what is trying to delete them?

@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 2, 2020

Yeah, it looks like what is happening is that the NLB or ALB created by the Service/Deployment in the NLB hello world and ALB2048 tests are left around even after the Kubernetes cluster is deleted. This means that when attempting to delete subnets during the deleteVPC stage, aws-k8s-tester is getting failures due to the subnets having active ENIs on them. And aws-k8s-tester is not able to detach these ENIs because they were created by the IAM role that was running the Kubernetes cluster's LoadBalancer AWS driver, and the Kubernetes cluster has already been deleted at this point, along with the IAM role that created the NLB and ALB. :(

@jaypipes jaypipes added the bug Something isn't working label Jan 2, 2020
@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 2, 2020

I think a potential solution to this -- or at least a workaround for the time-being until there is a way to prevent a Kubernetes cluster or namespace from being deleted if there are active cloud LB resources still associated with a deleted k8s LoadBalancer object -- is to simply add more of a wait before deleting the managed node group role and starting the deletion of the VPCs. I'll push up a PR that does just that.

@jaypipes jaypipes self-assigned this Jan 2, 2020
@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 2, 2020

@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 2, 2020

Note that LB finalizer support was only alpha in 1.15: kubernetes/kubernetes#78262.

jaypipes added a commit to jaypipes/aws-k8s-tester that referenced this issue Jan 2, 2020
Wait for before trying to delete the k8s namespace on cluster down and
when MNG and the ALB/NLB testers have been run.

Because we asked Kubernetes to delete the NLB hello world and ALB2048
Deployment/Service above, and both of these interact with the underlying
Kubernetes AWS cloud provider to clean up the cloud load balancer
backing the Service of type LoadBalancer. The calls to delete the
Service return immediately (successfully) but the cloud load balancer
resources may not have been deleted yet, including the ENIs that were
associated with the cloud load balancer. When, later, aws-k8s-tester
tries deleting the VPC associated with the test cluster, it will run
into permissions issues because the IAM role that created the ENIs
associated with the ENIs in subnets associated with the cloud load
balancers will no longer exist.

Issue aws#70
@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 3, 2020

Still not fixed. Using v0.5.3 this morning, EKS cluster came up properly, ran NLB and ALB probes successfully, ran integration tests and then started to delete the cluster and got the same output:

{"level":"info","ts":"2020-01-03T08:18:07.851-0500","caller":"eks/eks.go:262","msg":"Up completed","config-path":"/tmp/cni-test/cluster-cni-test-17428/cni-test-17428.yaml","KUBECONFIG":"/tmp/cni-test/cluster-cni-test-17428/kubeconfig","cluster-arn":"arn:aws:eks:us-west-2:750630568209:cluster/eks-2020010312-pdx-us-west-2-tafjy","request-started":"31 minutes ago"}
'aws-k8s-tester eks create cluster' success
{"level":"info","ts":"2020-01-03T08:18:33.750-0500","caller":"eks/eks.go:109","msg":"aws version","aws-cli-path":"/usr/local/bin/aws","aws-version":"aws-cli/1.16.189 Python/3.7.5 Linux/5.3.0-24-generic botocore/1.12.179\n"}
{"level":"info","ts":"2020-01-03T08:18:33.750-0500","caller":"eks/eks.go:115","msg":"mkdir","kubectl-path-dir":"/tmp/aws-k8s-tester"}
{"level":"info","ts":"2020-01-03T08:18:33.750-0500","caller":"eks/eks.go:119","msg":"downloading kubectl","kubectl-path":"/tmp/aws-k8s-tester/kubectl"}
{"level":"info","ts":"2020-01-03T08:18:33.756-0500","caller":"eks/utils.go:44","msg":"downloading","url":"https://amazon-eks.s3-us-west-2.amazonaws.com/1.14.6/2019-08-22/bin/linux/amd64/kubectl"}
{"level":"info","ts":"2020-01-03T08:18:39.318-0500","caller":"eks/utils.go:60","msg":"downloaded","url":"https://amazon-eks.s3-us-west-2.amazonaws.com/1.14.6/2019-08-22/bin/linux/amd64/kubectl","file-path":"/tmp/aws-k8s-tester/kubectl"}
{"level":"info","ts":"2020-01-03T08:18:39.358-0500","caller":"eks/eks.go:152","msg":"kubectl version","kubectl-path":"/tmp/aws-k8s-tester/kubectl","kubectl-version":"Client Version: version.Info{Major:\"1\", Minor:\"14+\", GitVersion:\"v1.14.7-eks-1861c5\", GitCommit:\"1861c597586f84f1498a9f2151c78d8a6bf47814\", GitTreeState:\"clean\", BuildDate:\"2019-09-24T22:12:08Z\", GoVersion:\"go1.12.9\", Compiler:\"gc\", Platform:\"linux/amd64\"}\n"}
{"level":"info","ts":"2020-01-03T08:18:39.358-0500","caller":"awsapi/aws.go:69","msg":"creating session from AWS cred file","path":"/home/jaypipes/.aws/credentials"}
{"level":"info","ts":"2020-01-03T08:18:39.752-0500","caller":"awsapi/aws.go:99","msg":"creating AWS session","account-id":"750630568209","user-id":"AIDA25RIRQEI4RIUYVNYD","arn":"arn:aws:iam::750630568209:user/jaypipes"}
{"level":"info","ts":"2020-01-03T08:18:39.752-0500","caller":"awsapi/aws.go:69","msg":"creating session from AWS cred file","path":"/home/jaypipes/.aws/credentials"}
{"level":"info","ts":"2020-01-03T08:18:39.813-0500","caller":"awsapi/aws.go:99","msg":"creating AWS session","account-id":"750630568209","user-id":"AIDA25RIRQEI4RIUYVNYD","arn":"arn:aws:iam::750630568209:user/jaypipes"}
{"level":"warn","ts":"2020-01-03T08:18:39.824-0500","caller":"eks/eks.go:418","msg":"starting Down","name":"eks-2020010312-pdx-us-west-2-tafjy","config-path":"/tmp/cni-test/cluster-cni-test-17428/cni-test-17428.yaml","cluster-arn":"arn:aws:eks:us-west-2:750630568209:cluster/eks-2020010312-pdx-us-west-2-tafjy"}
{"level":"info","ts":"2020-01-03T08:18:39.824-0500","caller":"eks/key-pair.go:69","msg":"deleting a key pair","key-pair-name":"eks-2020010312-pdx-us-west-2-tafjy-ssh-key-pair"}
{"level":"info","ts":"2020-01-03T08:18:39.824-0500","caller":"eks/cluster-role.go:165","msg":"deleting role CFN stack","cluster-role-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42"}
{"level":"info","ts":"2020-01-03T08:18:40.414-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:18:41.686-0500","caller":"eks/key-pair.go:102","msg":"deleted a key pair","key-pair-name":"eks-2020010312-pdx-us-west-2-tafjy-ssh-key-pair"}
{"level":"info","ts":"2020-01-03T08:18:41.686-0500","caller":"eks/eks.go:485","msg":"waited delete goroutine with no error","name":"EC2 key pair","waited-goroutines":1}
{"level":"info","ts":"2020-01-03T08:18:50.821-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","initial-wait":"25s"}
{"level":"info","ts":"2020-01-03T08:19:15.821-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-cluster-role","current":"DELETE_COMPLETE","want":"DELETE_COMPLETE","reason":"","request-started":"35 seconds ago"}
{"level":"info","ts":"2020-01-03T08:19:15.822-0500","caller":"cloudformation/cloudformation.go:166","msg":"became desired stack status; exiting","stack-status":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:19:15.822-0500","caller":"eks/cluster-role.go:197","msg":"deleted a role","cluster-role-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-cluster-role/11e6c4c0-2e27-11ea-b176-029123b1cc42","cluster-role-arn":"arn:aws:iam::750630568209:role/eks-2020010312-pdx-us-west-2-tafjy-cluster-role","cluster-role-name":"eks-2020010312-pdx-us-west-2-tafjy-cluster-role"}
{"level":"info","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/eks.go:485","msg":"waited delete goroutine with no error","name":"cluster role","waited-goroutines":2}
{"level":"warn","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/namespace.go:36","msg":"skipping namespace delete; empty k8s client"}
{"level":"info","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/managed-node-group.go:595","msg":"deleting managed node group","managed-node-group-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group"}
{"level":"info","ts":"2020-01-03T08:19:16.324-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:19:31.723-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","initial-wait":"2m0s"}
{"level":"info","ts":"2020-01-03T08:21:31.724-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group","current":"DELETE_IN_PROGRESS","want":"DELETE_COMPLETE","reason":"","request-started":"2 minutes ago"}
...snip...
{"level":"info","ts":"2020-01-03T08:28:16.718-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group","current":"DELETE_COMPLETE","want":"DELETE_COMPLETE","reason":"","request-started":"9 minutes ago"}
{"level":"info","ts":"2020-01-03T08:28:16.718-0500","caller":"cloudformation/cloudformation.go:166","msg":"became desired stack status; exiting","stack-status":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:28:16.730-0500","caller":"eks/managed-node-group.go:661","msg":"deleted a managed node group","managed-node-group-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group/845c4dc0-2e29-11ea-aef0-06be751c1bd6","managed-node-group-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group"}
{"level":"info","ts":"2020-01-03T08:28:16.741-0500","caller":"eks/eks.go:525","msg":"sleeping before node group role deletion","wait":"5s"}
{"level":"info","ts":"2020-01-03T08:28:21.741-0500","caller":"eks/managed-node-group-role.go:163","msg":"deleting managed node group role CFN stack","managed-node-group-role-cfn-stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518"}
{"level":"info","ts":"2020-01-03T08:28:21.940-0500","caller":"cloudformation/cloudformation.go:40","msg":"polling stack","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518","want":"DELETE_COMPLETE"}
{"level":"info","ts":"2020-01-03T08:28:32.346-0500","caller":"cloudformation/cloudformation.go:112","msg":"sleeping","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518","initial-wait":"1m0s"}
{"level":"info","ts":"2020-01-03T08:29:32.346-0500","caller":"cloudformation/cloudformation.go:141","msg":"polling","stack-id":"arn:aws:cloudformation:us-west-2:750630568209:stack/eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role/59dd65c0-2e29-11ea-8369-020934000518","stack-name":"eks-2020010312-pdx-us-west-2-tafjy-managed-node-group-role","current":"DELETE_FAILED","want":"DELETE_COMPLETE","reason":"The following resource(s) failed to delete: [ManagedNodeGroupRole]. ","request-started":"1 minute ago"}

The interesting thing is this line:

{"level":"info","ts":"2020-01-03T08:19:15.834-0500","caller":"eks/eks.go:485","msg":"waited delete goroutine with no error","name":"cluster role","waited-goroutines":2}

what that corresponds to is the following code:

aws-k8s-tester/eks/eks.go

Lines 439 to 487 in 456eccc

waits++
go func() {
ch <- errorData{name: "cluster role", err: ts.deleteClusterRole()}
}()
if ts.cfg.Parameters.ManagedNodeGroupCreate {
waits++
go func() {
ch <- errorData{name: "EC2 key pair", err: ts.deleteKeyPair()}
}()
if ts.cfg.AddOnJobEcho.Enable && ts.jobEchoTester != nil {
waits++
go func() {
ch <- errorData{name: "Job echo", err: ts.jobEchoTester.Delete()}
}()
}
if ts.cfg.AddOnJobPerl.Enable && ts.jobPiTester != nil {
waits++
go func() {
ch <- errorData{name: "Job Pi", err: ts.jobPiTester.Delete()}
}()
}
if ts.cfg.AddOnALB2048.Enable && ts.alb2048Tester != nil {
waits++
go func() {
ch <- errorData{name: "ALB", err: ts.alb2048Tester.Delete()}
}()
}
if ts.cfg.AddOnNLBHelloWorld.Enable && ts.nlbHelloWorldTester != nil {
waits++
go func() {
ch <- errorData{name: "NLB", err: ts.nlbHelloWorldTester.Delete()}
}()
}
}
var errs []string
for i := 0; i < waits; i++ {
if d := <-ch; d.err != nil {
ts.lg.Warn("failed to delete",
zap.String("name", d.name),
zap.Int("waited-goroutines", i+1),
zap.Error(d.err),
)
errs = append(errs, d.err.Error())
} else {
ts.lg.Info("waited delete goroutine with no error", zap.String("name", d.name), zap.Int("waited-goroutines", i+1))
}
}

which, unless I'm mistaken, SHOULD HAVE read "waited-goroutines": 6 since the cluster role, EC2 keypair needed to be deleted and the echo tester, Pi tester, ALB and NLB testers all ran successfully. For some reason those tester's Delete() method isn't being called for some reason...

What that means is that this conditional:

if ts.cfg.Parameters.ManagedNodeGroupCreate

evaluated to true, but these:

if ts.cfg.AddOnALB2048.Enable && ts.alb2048Tester != nil {

did not.

@jaypipes
Copy link
Contributor Author

jaypipes commented Jan 3, 2020

Think I figured out what's going on. The individual echo/NLB/ALB tester instances are nil when Tester.down() has been called from an invocation of aws-k8s-tester delete cluster.

When aws-k8s-tester create cluster is called, the Tester.up() method ends up creating these sub-tester objects, like so:

		if ts.cfg.AddOnJobEcho.Enable {
			ts.jobEchoTester, err = jobs.New(jobs.Config{
				Logger:    ts.lg,
				Stopc:     ts.stopCreationCh,
				Sig:       ts.interruptSig,
				K8SClient: ts.k8sClientSet,
				Namespace: ts.cfg.Name,
				JobName:   jobs.JobNameEcho,
				Completes: ts.cfg.AddOnJobEcho.Completes,
				Parallels: ts.cfg.AddOnJobEcho.Parallels,
				EchoSize:  ts.cfg.AddOnJobEcho.Size,
			})
			if err != nil {
				return err
			}
			if err := catchInterrupt(ts.lg, ts.stopCreationCh, ts.stopCreationChOnce, ts.interruptSig, ts.jobEchoTester.Create); err != nil {
				return err
			}
		}

However, when Tester.down() is called, there is a check to see if the sub-tester instances are non-nil:

		if ts.cfg.AddOnJobEcho.Enable && ts.jobEchoTester != nil {
			waits++
			go func() {
				ch <- errorData{name: "Job echo", err: ts.jobEchoTester.Delete()}
			}()
		}

and when Tester.down() is called from aws-k8s-tester delete cluster, those sub-tester instances are all nil.

The solution is to create the sub-tester instances but not create the Job in kubernetes.

@gyuho
Copy link
Contributor

gyuho commented Jan 3, 2020

and when Tester.down() is called from aws-k8s-tester delete cluster, those sub-tester instances are all nil.

Right... we would need update ts.jobEchoTester when down is called via the sub-command.

jaypipes added a commit to jaypipes/aws-k8s-tester that referenced this issue Jan 3, 2020
When tearing down Kubernetes resources created during the job
sub-testers (e.g. NLB hello world), Tester.down() was skipping the
deletion of the resources when `aws-k8s-tester delete cluster` was being
run. This was due to the construction of the sub-tester instances only
happening when Tester.up() was called.

This patch fixes this by always creating the sub-tester instances on
construction of the main Tester object.

Issue aws#70
jaypipes added a commit to jaypipes/aws-k8s-tester that referenced this issue Jan 3, 2020
Instead of creating a K8sClient object when creating the subtester
configs, pass a pointer to the main Tester struct that has a
KubernetesClientSet() method that returns a pointer to
clientset.ClientSet. This ensures that a) each subtester object uses a
single clientset instance and b) that we can construct the clientset if
the Tester has not yet been initialized (by calling updateK8sClientSet()
from Tester.KubernetesClientSet)

Issue aws#70
jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue Jan 3, 2020
The names of environment variables accepted by aws-k8s-tester changed
when the managed node group functionality was introduced. This commit
updates the integration testing scripts to call aws-k8s-tester (v.0.5.4
which is the release needed with the fix for aws/aws-k8s-tester#70) with
these updated environment variables.

We decrease the number of parallel builds of the echo job from 100 to
3 and the number of completions for that job from 1000 to 30. This
decreases the setup time of the cluster by about 10 minutes.

Finally, I added in a short-circuit to prevent double-deprovisioning of
the cluster if, say, a stacktrace occurred when running the
aws-k8s-tester tool.

Issue aws#686
Issue aws#784
Issue aws#786
jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue Jan 17, 2020
The names of environment variables accepted by aws-k8s-tester changed
when the managed node group functionality was introduced. This commit
updates the integration testing scripts to call aws-k8s-tester (v.0.5.4
which is the release needed with the fix for aws/aws-k8s-tester#70) with
these updated environment variables.

We decrease the number of parallel builds of the echo job from 100 to
3 and the number of completions for that job from 1000 to 30. This
decreases the setup time of the cluster by about 10 minutes.

Finally, I added in a short-circuit to prevent double-deprovisioning of
the cluster if, say, a stacktrace occurred when running the
aws-k8s-tester tool.

Issue aws#686
Issue aws#784
Issue aws#786
jaypipes added a commit to jaypipes/amazon-vpc-cni-k8s that referenced this issue Jan 17, 2020
The names of environment variables accepted by aws-k8s-tester changed
when the managed node group functionality was introduced. This commit
updates the integration testing scripts to call aws-k8s-tester (v.0.5.4
which is the release needed with the fix for aws/aws-k8s-tester#70) with
these updated environment variables.

We decrease the number of parallel builds of the echo job from 100 to
3 and the number of completions for that job from 1000 to 30. This
decreases the setup time of the cluster by about 10 minutes.

Finally, I added in a short-circuit to prevent double-deprovisioning of
the cluster if, say, a stacktrace occurred when running the
aws-k8s-tester tool.

Issue aws#686
Issue aws#784
Issue aws#786
mogren pushed a commit to aws/amazon-vpc-cni-k8s that referenced this issue Jan 20, 2020
The names of environment variables accepted by aws-k8s-tester changed
when the managed node group functionality was introduced. This commit
updates the integration testing scripts to call aws-k8s-tester (v.0.5.4
which is the release needed with the fix for aws/aws-k8s-tester#70) with
these updated environment variables.

We decrease the number of parallel builds of the echo job from 100 to
3 and the number of completions for that job from 1000 to 30. This
decreases the setup time of the cluster by about 10 minutes.

Finally, I added in a short-circuit to prevent double-deprovisioning of
the cluster if, say, a stacktrace occurred when running the
aws-k8s-tester tool.

Issue #686
Issue #784
Issue #786
@gyuho
Copy link
Contributor

gyuho commented Jan 30, 2020

I am seeing this same issue still... Deleting service does not delete ENIs somehow.

gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 30, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jan 31, 2020
The names of environment variables accepted by aws-k8s-tester changed
when the managed node group functionality was introduced. This commit
updates the integration testing scripts to call aws-k8s-tester (v.0.5.4
which is the release needed with the fix for aws/aws-k8s-tester#70) with
these updated environment variables.

We decrease the number of parallel builds of the echo job from 100 to
3 and the number of completions for that job from 1000 to 30. This
decreases the setup time of the cluster by about 10 minutes.

Finally, I added in a short-circuit to prevent double-deprovisioning of
the cluster if, say, a stacktrace occurred when running the
aws-k8s-tester tool.

Issue aws#686
Issue aws#784
Issue aws#786

(cherry picked from commit 8291df3)
jaypipes added a commit to aws/amazon-vpc-cni-k8s that referenced this issue Jan 31, 2020
The names of environment variables accepted by aws-k8s-tester changed
when the managed node group functionality was introduced. This commit
updates the integration testing scripts to call aws-k8s-tester (v.0.5.4
which is the release needed with the fix for aws/aws-k8s-tester#70) with
these updated environment variables.

We decrease the number of parallel builds of the echo job from 100 to
3 and the number of completions for that job from 1000 to 30. This
decreases the setup time of the cluster by about 10 minutes.

Finally, I added in a short-circuit to prevent double-deprovisioning of
the cluster if, say, a stacktrace occurred when running the
aws-k8s-tester tool.

Issue #686
Issue #784
Issue #786

(cherry picked from commit 8291df3)
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Jan 31, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit to gyuho/aws-k8s-tester that referenced this issue Feb 1, 2020
ref. aws#70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
gyuho added a commit that referenced this issue Feb 1, 2020
ref. #70

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
@gyuho
Copy link
Contributor

gyuho commented Feb 3, 2020

Discussed with @mogren in person. I will try adding more wait-time between ENI detach and delete with retries.

@gyuho
Copy link
Contributor

gyuho commented Feb 4, 2020

Made some progress in the latest code changes https://github.com/aws/aws-k8s-tester/blob/master/CHANGELOG-0.5.md.

  • Add more waits
  • Add more retries
  • Delete ELB in the second try if dependent target group is found

https://github.com/aws/aws-k8s-tester/blob/master/eks/elb/elb.go

@gyuho
Copy link
Contributor

gyuho commented Feb 6, 2020

@gyuho
Copy link
Contributor

gyuho commented Feb 6, 2020

Closing for now. After adding more retries and waits, resource clean-up operation is stable. Will reopen when it happens again.

Thanks @jaypipes !

@gyuho gyuho closed this as completed Feb 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants