Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix managed suite failures #5696

Merged
merged 5 commits into from
Sep 16, 2022
Merged

Conversation

Himangini
Copy link
Collaborator

@Himangini Himangini commented Sep 14, 2022

Description

Closes https://github.com/weaveworks/eksctl-ci/issues/57 , #5682, #5683, #5684, #5685, #5686

The managed suite was failing consistently from the past two weeks. Logs below from debugging

Failing aws-node

Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       56m                   default-scheduler  Successfully assigned kube-system/aws-node-crvvl to ip-192-168-54-117.us-west-2.compute.internal
  Normal   Pulling         56m                   kubelet            Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.3-eksbuild.1"
  Normal   Pulled          56m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.3-eksbuild.1" in 13.563162713s
  Normal   Created         56m                   kubelet            Created container aws-vpc-cni-init
  Normal   Started         56m                   kubelet            Started container aws-vpc-cni-init
  Normal   Pulling         56m                   kubelet            Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.3-eksbuild.1"
  Normal   Pulled          56m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.3-eksbuild.1" in 2.992180181s
  Normal   Started         56m                   kubelet            Started container aws-node
  Normal   Created         56m                   kubelet            Created container aws-node
  Warning  Unhealthy       56m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:45:02.616Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  NodeNotReady    52m                   node-controller    Node is not ready
  Normal   SandboxChanged  51m                   kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          51m                   kubelet            Container image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.3-eksbuild.1" already present on machine
  Normal   Created         51m                   kubelet            Created container aws-vpc-cni-init
  Normal   Started         51m                   kubelet            Started container aws-vpc-cni-init
  Normal   Started         51m                   kubelet            Started container aws-node
  Warning  Unhealthy       51m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:00.337Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:05.407Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:10.472Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:20.332Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:30.338Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:40.340Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:50:50.336Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-09-12T15:51:00.312Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy       50m                   kubelet            Liveness probe failed: {"level":"info","ts":"2022-09-12T15:51:00.201Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Normal   Killing         49m                   kubelet            Container aws-node failed liveness probe, will be restarted
  Normal   Pulled          49m (x2 over 51m)     kubelet            Container image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.3-eksbuild.1" already present on machine
  Normal   Created         49m (x2 over 51m)     kubelet            Created container aws-node
  Warning  Unhealthy       6m7s (x197 over 49m)  kubelet            (combined from similar events): Readiness probe failed: {"level":"info","ts":"2022-09-12T16:34:55.376Z","caller":"/usr/local/go/src/runtime/proc.go:250","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  BackOff         66s (x133 over 42m)   kubelet            Back-off restarting failed container

Logs from aws-node

=== logs, pod=aws-node-crvvl, container=aws-node

{"level":"info","ts":"2022-09-12T16:18:00.379Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-09-12T16:18:00.380Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-09-12T16:18:00.399Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-09-12T16:18:00.401Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-09-12T16:18:02.410Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-09-12T16:18:04.417Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
....

Logs from dns pod stuck in ContainerCreating stage

Events:
  Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Warning  FailedCreatePodSandBox  103s (x341 over 74m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "21d456099493a14e2ccafec779049b7999ff955f14bc4735483841db0ee5f240": add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"

Looking at the troubleshooting guides and increasing the livelines and readiness probe timeouts didn't help, the test suite was still failing.
https://aws.amazon.com/premiumsupport/knowledge-center/eks-failed-create-pod-sandbox/
aws/amazon-vpc-cni-k8s#1847
aws/amazon-vpc-cni-k8s#1055
aws/amazon-vpc-cni-k8s#1425
https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on

Also following the troubleshooting guide https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and applying the suggestion
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml
didn't work. After running the apply command the aws-nodes would start crashing again after some time

Finally found the cause 😅 Due to insufficient memory, the Nodes kept restarting.
I updated the instanceType from m5.large to t3a.xlarge since these are best suited for running build/test environments. Also, they are very cost-effective than using general purpose instances.
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/burstable-performance-instances.html
https://aws.amazon.com/ec2/instance-types/t3/
https://aws.amazon.com/ec2/instance-types/m5/

Last note on the performance, after the added restructuring and cleanup changes in this PR the execution time for running managed suite went down from 2h 30m 53s to 1h 46m 47s
🤑 ⚡

Rest of the commits are package updates, updates to the vpc-cni plugin version to 1.11.3 as suggested here and updates to auto-generated files as a consequence of running make unit-test locally.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@Himangini Himangini added the skip-release-notes Causes PR not to show in release notes label Sep 14, 2022
@Himangini Himangini self-assigned this Sep 14, 2022
@TiberiuGC
Copy link
Collaborator

Outstanding work debugging this! ✨

Comment on lines -164 to -292
},
},
{
NodeGroupBase: &api.NodeGroupBase{
Name: ubuntuNodegroup,
VolumeSize: aws.Int(25),
AMIFamily: "Ubuntu2004",
},
},
}

cmd := params.EksctlCreateCmd.
WithArgs(
"nodegroup",
"--config-file", "-",
"--verbose", "4",
).
WithoutArg("--region", params.Region).
WithStdin(clusterutils.Reader(clusterConfig))

Expect(cmd).To(RunSuccessfully())

tests.AssertNodeVolumes(params.KubeconfigPath, params.Region, bottlerocketNodegroup, "/dev/sdb")

By("correctly configuring the bottlerocket nodegroup")
kubeTest, err := kube.NewTest(params.KubeconfigPath)
Expect(err).NotTo(HaveOccurred())

nodeList := kubeTest.ListNodes(metav1.ListOptions{
LabelSelector: fmt.Sprintf("%s=%s", "eks.amazonaws.com/nodegroup", bottlerocketNodegroup),
})
Expect(nodeList.Items).NotTo(BeEmpty())
for _, node := range nodeList.Items {
Expect(node.Status.NodeInfo.OSImage).To(ContainSubstring("Bottlerocket"))
Expect(node.Labels["kubernetes.io/hostname"]).To(Equal("custom-bottlerocket-host"))
}
kubeTest.Close()
})

It("should have created an EKS cluster and 4 CloudFormation stacks", func() {
awsSession := NewConfig(params.Region)

Expect(awsSession).To(HaveExistingCluster(params.ClusterName, string(ekstypes.ClusterStatusActive), params.Version))

Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-cluster", params.ClusterName)))
Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-nodegroup-%s", params.ClusterName, initialAl2Nodegroup)))
Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-nodegroup-%s", params.ClusterName, bottlerocketNodegroup)))
Expect(awsSession).To(HaveExistingStack(fmt.Sprintf("eksctl-%s-nodegroup-%s", params.ClusterName, ubuntuNodegroup)))
})

It("should have created a valid kubectl config file", func() {
config, err := clientcmd.LoadFromFile(params.KubeconfigPath)
Expect(err).ShouldNot(HaveOccurred())

err = clientcmd.ConfirmUsable(*config, "")
Expect(err).ShouldNot(HaveOccurred())

Expect(config.CurrentContext).To(ContainSubstring("eksctl"))
Expect(config.CurrentContext).To(ContainSubstring(params.ClusterName))
Expect(config.CurrentContext).To(ContainSubstring(params.Region))
})

Context("and listing clusters", func() {
It("should return the previously created cluster", func() {
cmd := params.EksctlGetCmd.WithArgs("clusters", "--all-regions")
Expect(cmd).To(RunSuccessfullyWithOutputString(ContainSubstring(params.ClusterName)))
})
})

Context("and checking the nodegroup health", func() {
It("should return healthy", func() {
checkNg := func(ngName string) {
cmd := params.EksctlUtilsCmd.WithArgs(
"nodegroup-health",
"--cluster", params.ClusterName,
"--name", ngName,
)

Expect(cmd).To(RunSuccessfullyWithOutputString(ContainSubstring("active")))
}

checkNg(initialAl2Nodegroup)
checkNg(bottlerocketNodegroup)
checkNg(ubuntuNodegroup)
})
})

Context("and scale the initial nodegroup", func() {
It("should not return an error", func() {
cmd := params.EksctlScaleNodeGroupCmd.WithArgs(
"--cluster", params.ClusterName,
"--nodes-min", "2",
"--nodes", "3",
"--nodes-max", "4",
"--name", initialAl2Nodegroup,
)
Expect(cmd).To(RunSuccessfully())
})
})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the tests fail if this code isn't moved around?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I moved it around to speed up the tests mostly.

@Himangini Himangini merged commit 3ba94c1 into eksctl-io:main Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-release-notes Causes PR not to show in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants