Fix managed suite failures #5696

Himangini · 2022-09-14T11:29:38Z

Description

Closes https://github.com/weaveworks/eksctl-ci/issues/57 , #5682, #5683, #5684, #5685, #5686

The managed suite was failing consistently from the past two weeks. Logs below from debugging

Failing aws-node

Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       56m                   default-scheduler  Successfully assigned kube-system/aws-node-crvvl to ip-192-168-54-117.us-west-2.compute.internal
  Normal   Pulling         56m                   kubelet            Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.3-eksbuild.1"
  Normal   Pulled          56m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.3-eksbuild.1" in 13.563162713s
  Normal   Created         56m                   kubelet            Created container aws-vpc-cni-init
  Normal   Started         56m                   kubelet            Started container aws-vpc-cni-init
  Normal   Pulling         56m                   kubelet            Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.3-eksbuild.1"
  Normal   Pulled          56m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.3-eksbuild.1" in 2.992180181s
  Normal   Started         56m                   kubelet            Started container aws-node
  Normal   Created         56m                   kubelet            Created container aws-node
  Warning  Unhealthy       56m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:45:02.616Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  NodeNotReady    52m                   node-controller    Node is not ready
  Normal   SandboxChanged  51m                   kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          51m                   kubelet            Container image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.3-eksbuild.1" already present on machine
  Normal   Created         51m                   kubelet            Created container aws-vpc-cni-init
  Normal   Started         51m                   kubelet            Started container aws-vpc-cni-init
  Normal   Started         51m                   kubelet            Started container aws-node
  Warning  Unhealthy       51m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:00.337Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:05.407Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:10.472Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:20.332Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:30.338Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:40.340Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:50.336Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:51:00.312Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Liveness probe failed: {"level":"info","ts":"2022-09-12T15:51:00.201Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Normal   Killing         49m                   kubelet            Container aws-node failed liveness probe, will be restarted
  Normal   Pulled          49m (x2 over 51m)     kubelet            Container image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.3-eksbuild.1" already present on machine
  Normal   Created         49m (x2 over 51m)     kubelet            Created container aws-node
  Warning  Unhealthy       6m7s (x197 over 49m)  kubelet            (combined from similar events): Readiness probe failed: {"level":"info","ts":"2022-09-12T16:34:55.376Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  BackOff         66s (x133 over 42m)   kubelet            Back-off restarting failed container

Logs from aws-node

=== logs, pod=aws-node-crvvl, container=aws-node

{"level":"info","ts":"2022-09-12T16:18:00.379Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-09-12T16:18:00.380Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-09-12T16:18:00.399Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-09-12T16:18:00.401Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-09-12T16:18:02.410Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-09-12T16:18:04.417Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
....

Logs from dns pod stuck in ContainerCreating stage

Events:
  Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Warning  FailedCreatePodSandBox  103s (x341 over 74m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "21d456099493a14e2ccafec779049b7999ff955f14bc4735483841db0ee5f240": add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

Looking at the troubleshooting guides and increasing the livelines and readiness probe timeouts didn't help, the test suite was still failing.
https://aws.amazon.com/premiumsupport/knowledge-center/eks-failed-create-pod-sandbox/
aws/amazon-vpc-cni-k8s#1847
aws/amazon-vpc-cni-k8s#1055
aws/amazon-vpc-cni-k8s#1425
https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on

Also following the troubleshooting guide https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and applying the suggestion
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml
didn't work. After running the apply command the aws-nodes would start crashing again after some time

Finally found the cause 😅 Due to insufficient memory, the Nodes kept restarting.
I updated the instanceType from m5.large to t3a.xlarge since these are best suited for running build/test environments. Also, they are very cost-effective than using general purpose instances.
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/burstable-performance-instances.html
https://aws.amazon.com/ec2/instance-types/t3/
https://aws.amazon.com/ec2/instance-types/m5/

Last note on the performance, after the added restructuring and cleanup changes in this PR the execution time for running managed suite went down from 2h 30m 53s to 1h 46m 47s
🤑 ⚡

Rest of the commits are package updates, updates to the vpc-cni plugin version to 1.11.3 as suggested here and updates to auto-generated files as a consequence of running make unit-test locally.

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the userdocs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes
(Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

TiberiuGC · 2022-09-14T13:19:49Z

Outstanding work debugging this! ✨

some changes to generated files and packages due to running make unit-test

cPu1 · 2022-09-16T09:29:16Z

integration/tests/managed/managed_nodegroup_test.go

-					},
-				},
-				{
-					NodeGroupBase: &api.NodeGroupBase{
-						Name:       ubuntuNodegroup,
-						VolumeSize: aws.Int(25),
-						AMIFamily:  "Ubuntu2004",
-					},
-				},
-			}
-
-			cmd := params.EksctlCreateCmd.
-				WithArgs(
-					"nodegroup",
-					"--config-file", "-",
-					"--verbose", "4",
-				).
-				WithoutArg("--region", params.Region).
-				WithStdin(clusterutils.Reader(clusterConfig))
-
-			Expect(cmd).To(RunSuccessfully())
-
-			tests.AssertNodeVolumes(params.KubeconfigPath, params.Region, bottlerocketNodegroup, "/dev/sdb")
-
-			By("correctly configuring the bottlerocket nodegroup")
-			kubeTest, err := kube.NewTest(params.KubeconfigPath)
-			Expect(err).NotTo(HaveOccurred())
-
-			nodeList := kubeTest.ListNodes(metav1.ListOptions{
-				LabelSelector: fmt.Sprintf("%s=%s", "eks.amazonaws.com/nodegroup", bottlerocketNodegroup),
-			})
-			Expect(nodeList.Items).NotTo(BeEmpty())
-			for _, node := range nodeList.Items {
-				Expect(node.Status.NodeInfo.OSImage).To(ContainSubstring("Bottlerocket"))
-				Expect(node.Labels["kubernetes.io/hostname"]).To(Equal("custom-bottlerocket-host"))
-			}
-			kubeTest.Close()
-		})
-
-		It("should have created an EKS cluster and 4 CloudFormation stacks", func() {
-			awsSession := NewConfig(params.Region)
-
-			Expect(awsSession).To(HaveExistingCluster(params.ClusterName, string(ekstypes.ClusterStatusActive), params.Version))
-
-			Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-cluster", params.ClusterName)))
-			Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-nodegroup-%s", params.ClusterName, initialAl2Nodegroup)))
-			Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-nodegroup-%s", params.ClusterName, bottlerocketNodegroup)))
-			Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-nodegroup-%s", params.ClusterName, ubuntuNodegroup)))
-		})
-
-		It("should have created a valid kubectl config file", func() {
-			config, err := clientcmd.LoadFromFile(params.KubeconfigPath)
-			Expect(err).ShouldNot(HaveOccurred())
-
-			err = clientcmd.ConfirmUsable(*config, "")
-			Expect(err).ShouldNot(HaveOccurred())
-
-			Expect(config.CurrentContext).To(ContainSubstring("eksctl"))
-			Expect(config.CurrentContext).To(ContainSubstring(params.ClusterName))
-			Expect(config.CurrentContext).To(ContainSubstring(params.Region))
-		})
-
-		Context("and listing clusters", func() {
-			It("should return the previously created cluster", func() {
-				cmd := params.EksctlGetCmd.WithArgs("clusters", "--all-regions")
-				Expect(cmd).To(RunSuccessfullyWithOutputString(ContainSubstring(params.ClusterName)))
-			})
-		})
-
-		Context("and checking the nodegroup health", func() {
-			It("should return healthy", func() {
-				checkNg := func(ngName string) {
-					cmd := params.EksctlUtilsCmd.WithArgs(
-						"nodegroup-health",
-						"--cluster", params.ClusterName,
-						"--name", ngName,
-					)
-
-					Expect(cmd).To(RunSuccessfullyWithOutputString(ContainSubstring("active")))
-				}
-
-				checkNg(initialAl2Nodegroup)
-				checkNg(bottlerocketNodegroup)
-				checkNg(ubuntuNodegroup)
-			})
-		})
-
-		Context("and scale the initial nodegroup", func() {
-			It("should not return an error", func() {
-				cmd := params.EksctlScaleNodeGroupCmd.WithArgs(
-					"--cluster", params.ClusterName,
-					"--nodes-min", "2",
-					"--nodes", "3",
-					"--nodes-max", "4",
-					"--name", initialAl2Nodegroup,
-				)
-				Expect(cmd).To(RunSuccessfully())
-			})
-		})
-


Do the tests fail if this code isn't moved around?

No. I moved it around to speed up the tests mostly.

Himangini added the skip-release-notes Causes PR not to show in release notes label Sep 14, 2022

Himangini self-assigned this Sep 14, 2022

Himangini requested review from cPu1 and TiberiuGC September 14, 2022 11:30

TiberiuGC approved these changes Sep 14, 2022

View reviewed changes

Himangini added 5 commits September 15, 2022 13:41

updated the vpc-cni release and aws-node and increase some timeouts

032d5ef

some changes to generated files and packages due to running make unit-test

increased timeouts in test-dns.yaml

131e142

actually fix the test - added new instance type and rearranged the tests

1211cbb

clean up

d5905a2

updated packages

a1fc8e3

Himangini force-pushed the fix-managed-suite branch from 7e9c901 to a1fc8e3 Compare September 15, 2022 12:41

Himangini enabled auto-merge (squash) September 15, 2022 13:52

cPu1 reviewed Sep 16, 2022

View reviewed changes

cPu1 approved these changes Sep 16, 2022

View reviewed changes

Himangini merged commit 3ba94c1 into eksctl-io:main Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix managed suite failures #5696

Fix managed suite failures #5696

Himangini commented Sep 14, 2022 •

edited

Loading

TiberiuGC commented Sep 14, 2022

cPu1 Sep 16, 2022

Himangini Sep 16, 2022

Fix managed suite failures #5696

Fix managed suite failures #5696

Conversation

Himangini commented Sep 14, 2022 • edited Loading

Description

Checklist

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

TiberiuGC commented Sep 14, 2022

cPu1 Sep 16, 2022

Choose a reason for hiding this comment

Himangini Sep 16, 2022

Choose a reason for hiding this comment

Himangini commented Sep 14, 2022 •

edited

Loading