-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to detach ENI that was created by aws-k8s-tester due to permissions issue #70
Comments
|
After looking at the orphaned objects, it looks like all ENIs that were left around after an |
Yeah, it looks like what is happening is that the NLB or ALB created by the Service/Deployment in the NLB hello world and ALB2048 tests are left around even after the Kubernetes cluster is deleted. This means that when attempting to delete subnets during the deleteVPC stage, aws-k8s-tester is getting failures due to the subnets having active ENIs on them. And aws-k8s-tester is not able to detach these ENIs because they were created by the IAM role that was running the Kubernetes cluster's LoadBalancer AWS driver, and the Kubernetes cluster has already been deleted at this point, along with the IAM role that created the NLB and ALB. :( |
I think a potential solution to this -- or at least a workaround for the time-being until there is a way to prevent a Kubernetes cluster or namespace from being deleted if there are active cloud LB resources still associated with a deleted k8s LoadBalancer object -- is to simply add more of a wait before deleting the managed node group role and starting the deletion of the VPCs. I'll push up a PR that does just that. |
Note that LB finalizer support was only alpha in 1.15: kubernetes/kubernetes#78262. |
Wait for before trying to delete the k8s namespace on cluster down and when MNG and the ALB/NLB testers have been run. Because we asked Kubernetes to delete the NLB hello world and ALB2048 Deployment/Service above, and both of these interact with the underlying Kubernetes AWS cloud provider to clean up the cloud load balancer backing the Service of type LoadBalancer. The calls to delete the Service return immediately (successfully) but the cloud load balancer resources may not have been deleted yet, including the ENIs that were associated with the cloud load balancer. When, later, aws-k8s-tester tries deleting the VPC associated with the test cluster, it will run into permissions issues because the IAM role that created the ENIs associated with the ENIs in subnets associated with the cloud load balancers will no longer exist. Issue aws#70
Still not fixed. Using v0.5.3 this morning, EKS cluster came up properly, ran NLB and ALB probes successfully, ran integration tests and then started to delete the cluster and got the same output:
The interesting thing is this line:
what that corresponds to is the following code: Lines 439 to 487 in 456eccc
which, unless I'm mistaken, SHOULD HAVE read What that means is that this conditional: if ts.cfg.Parameters.ManagedNodeGroupCreate evaluated to true, but these: if ts.cfg.AddOnALB2048.Enable && ts.alb2048Tester != nil { did not. |
Think I figured out what's going on. The individual echo/NLB/ALB tester instances are nil when When if ts.cfg.AddOnJobEcho.Enable {
ts.jobEchoTester, err = jobs.New(jobs.Config{
Logger: ts.lg,
Stopc: ts.stopCreationCh,
Sig: ts.interruptSig,
K8SClient: ts.k8sClientSet,
Namespace: ts.cfg.Name,
JobName: jobs.JobNameEcho,
Completes: ts.cfg.AddOnJobEcho.Completes,
Parallels: ts.cfg.AddOnJobEcho.Parallels,
EchoSize: ts.cfg.AddOnJobEcho.Size,
})
if err != nil {
return err
}
if err := catchInterrupt(ts.lg, ts.stopCreationCh, ts.stopCreationChOnce, ts.interruptSig, ts.jobEchoTester.Create); err != nil {
return err
}
} However, when if ts.cfg.AddOnJobEcho.Enable && ts.jobEchoTester != nil {
waits++
go func() {
ch <- errorData{name: "Job echo", err: ts.jobEchoTester.Delete()}
}()
} and when The solution is to create the sub-tester instances but not create the |
Right... we would need update |
When tearing down Kubernetes resources created during the job sub-testers (e.g. NLB hello world), Tester.down() was skipping the deletion of the resources when `aws-k8s-tester delete cluster` was being run. This was due to the construction of the sub-tester instances only happening when Tester.up() was called. This patch fixes this by always creating the sub-tester instances on construction of the main Tester object. Issue aws#70
Instead of creating a K8sClient object when creating the subtester configs, pass a pointer to the main Tester struct that has a KubernetesClientSet() method that returns a pointer to clientset.ClientSet. This ensures that a) each subtester object uses a single clientset instance and b) that we can construct the clientset if the Tester has not yet been initialized (by calling updateK8sClientSet() from Tester.KubernetesClientSet) Issue aws#70
The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue aws#686 Issue aws#784 Issue aws#786
The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue aws#686 Issue aws#784 Issue aws#786
The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue aws#686 Issue aws#784 Issue aws#786
The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue #686 Issue #784 Issue #786
I am seeing this same issue still... Deleting service does not delete ENIs somehow. |
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue aws#686 Issue aws#784 Issue aws#786 (cherry picked from commit 8291df3)
The names of environment variables accepted by aws-k8s-tester changed when the managed node group functionality was introduced. This commit updates the integration testing scripts to call aws-k8s-tester (v.0.5.4 which is the release needed with the fix for aws/aws-k8s-tester#70) with these updated environment variables. We decrease the number of parallel builds of the echo job from 100 to 3 and the number of completions for that job from 1000 to 30. This decreases the setup time of the cluster by about 10 minutes. Finally, I added in a short-circuit to prevent double-deprovisioning of the cluster if, say, a stacktrace occurred when running the aws-k8s-tester tool. Issue #686 Issue #784 Issue #786 (cherry picked from commit 8291df3)
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. aws#70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
ref. #70 Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Discussed with @mogren in person. I will try adding more wait-time between ENI detach and delete with retries. |
Made some progress in the latest code changes https://github.com/aws/aws-k8s-tester/blob/master/CHANGELOG-0.5.md.
https://github.com/aws/aws-k8s-tester/blob/master/eks/elb/elb.go |
https://circleci.com/gh/aws/amazon-vpc-cni-k8s/107 had no issue :) |
Closing for now. After adding more retries and waits, resource clean-up operation is stable. Will reopen when it happens again. Thanks @jaypipes ! |
When attempting to delete a cluster that was successfully created using
aws-k8s-tester eks create cluster
, I am getting "failed to detach ENI: AuthFailure: You do not have permission to access the specified resource" errors.Here is the log output of the time in question, showing deleting the VPC CloudFormation stack failing, force-deleting subnets failing because of dependency violations, and then failure to detach ENIs that subnets were dependent on due to permissions issues:
I have no idea why the user would not have permission to access a resource that it had created just a few minutes earlier when creating the cluster itself... :(
The text was updated successfully, but these errors were encountered: