Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using amazon-vpc-cni-k8s outside eks #2839

Closed
is-it-ayush opened this issue Mar 12, 2024 · 30 comments
Closed

using amazon-vpc-cni-k8s outside eks #2839

is-it-ayush opened this issue Mar 12, 2024 · 30 comments

Comments

@is-it-ayush
Copy link

is-it-ayush commented Mar 12, 2024

What happened:

Hi! I have an ec2 instance & containerd as the container runtime inside a private subnet (which has outbound internet access) in ap-south-1. I have intialized a new cluster with kubeadm init on this master node. It ran successfully. I then wanted to install amazon-vpc-cni as the network manager for my k8s cluster. I ran kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/master/aws-k8s-cni.yaml and checked the pods in kubectl get pods -n kube-system. One of the pod created by amazon-vpc-cni-k8s named aws-node-xxxx throws an error when trying to initialise. I did kubectl describe pod aws-node-xxx -n kube-system and I get the following.

Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.16.4": failed to pull and unpack image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.16.4": failed to resolve reference "amazon-k8s-cni-init:v1.16.4": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credential

I don't understand why this fails. Is it not possible to use amazon-vpc-cni outside eks in self managed cluster? I also looked around here in issues & it seems like other people had this issue before but I was unable to resolve it myself. Here is my policy k8s_master_ecr inside a k8s_master role which is connected to this master instance via an instance profile,

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "K8sECR",
			"Effect": "Allow",
			"Action": [
				"ecr:GetAuthorizationToken",
				"ecr:BatchCheckLayerAvailability",
				"ecr:GetDownloadUrlForLayer",
				"ecr:GetRepositoryPolicy",
				"ecr:DescribeRepositories",
				"ecr:ListImages",
				"ecr:BatchGetImage"
			],
			"Resource": "*"
		}
	]
}

Environment:

  • Kubernetes version (use kubectl version):
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2
  • CNI Version: master branch
  • OS (e.g: cat /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/
  • Kernel (e.g. uname -a): Linux ip-x-x-x-x.ap-south-1.compute.internal 6.1.0-13-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
@kwohlfahrt
Copy link
Contributor

We are running the AWS CNI outside of EKS. We also have the AWS credential provider installed, this allows the kubelet to use the instance credentials to pull from private ECR registries. Before Kubernetes 1.28 (I think, might be off by a version), this functionality was bundled as part of the kubelet.

@is-it-ayush
Copy link
Author

is-it-ayush commented Mar 12, 2024

That's intresting @kwohlfahrt! I've never used aws-credential-provider. After reading into it, I have a few questions,

  • Should I just deploy it by applying all the files with kubectl apply -f listed here on github.com/kubernetes/cloud-provider-aws/tree/master/examples/existing-cluster/base.
  • Where do I get the binary aws-credential-provider?
  • Does it work with containerd? I tried manually placing a username and password @ /etc/containerd/config.toml but it didn't work. I was able to manually pul the image with sudo ctr images pull 602401143452.dkr.ecr.ap-south-1.amazonaws.com/amazon-k8s-cni-init:v1.16.4 -u AWS:$TOKEN where TOKEN=$(aws ecr get-login-password --region ap-south-1) but it didn't really seem to fix the above problem.

@kwohlfahrt
Copy link
Contributor

Should I just deploy it by applying all the files with kubectl apply -f listed here on github.com/kubernetes/cloud-provider-aws/tree/master/examples/existing-cluster/base.

AFAIK, the credential provider can't be installed by applying manifests, it must be installed to your node, since you must change the kubelet flags to use it. The binary and configuration must be placed on disk, and then the kubelet's flags have to be modified to point to the configuration, and the path to search for the binary. This is documented on this page, which also includes an example config.

Where do I get the binary aws-credential-provider?

Pre-built binaries can be found here (source)

Does it work with containerd?

Yes, we've used it with containerd in the past, though we are using cri-o now. AFAIK, the container runtime never interacts with the credential provider directly - the credential provider is called by the kubelet, which then passes the received credentials on to your container runtime. So it shouldn't matter whether you are using containerd, crio, etc.

@is-it-ayush
Copy link
Author

is-it-ayush commented Mar 13, 2024

Thank you so much @kwohlfahrt! I was able to follow through and resolve this and all the pods are successfully running now. These are the steps I took,

  • update cloud provider flag @ /etc/kubernetes/manifests/kube-controller-manager.yaml & /etc/kubernetes/manifests/kube-apiserver.yaml with --cloud-provider=external.
    • systemctl daemon-reload && systemctl restart kubelet.service
  • download ecr-credential-provider via curl -o ecr-credential-provider https://storage.googleapis.com/k8s-artifacts-prod/binaries/cloud-provider-aws/v1.29.0/linux/amd64/ecr-credential-provider-linux-amd64.
    • mv ecr-credential-provider /usr/bin/ecr-credential-provider
    • chmod +x /usr/bin/ecr-credential-provider
  • Create a credential-config.yaml with the following
apiVersion: kubelet.config.k8s.io/v1
kind: CredentialProviderConfig
providers:
  - name: ecr-credential-provider
    matchImages:
      - "*.dkr.ecr.*.amazonaws.com"
    defaultCacheDuration: "12h"
    apiVersion: credentialprovider.kubelet.k8s.io/v1
    env:
  • update kubelet start variables @ /etc/systemd/system/kubelet.service.d/aws.conf with the following.
[Service]
Environment="KUBELET_EXTRA_ARGS=--node-ip=<x.x.x.x> --node-labels=node.kubernetes.io/node= --cloud-provider=external --image-credential-provider-config=/home/admin/.aws/ecr-credential-config.yaml --image-credential-provider-bin-dir=/usr/bin"
  • systemctl daemon-reload && systemctl restart kubelet.service
  • apply kubectl -f aws-vpc-cni.yaml

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

@is-it-ayush is-it-ayush reopened this Mar 14, 2024
@is-it-ayush
Copy link
Author

is-it-ayush commented Mar 14, 2024

Hey @kwohlfahrt! It seems this wasn't resolved entirely. As soon as I joined another node I ran into troubles with aws-node pod failing to communicate with ipam from aws-vpc-cni but the logs from ipam didn't indicate any errors so I was unable to understand what's wrong. The setup hasn't changed & I only added one worker (1 master [10.0.32.163], 1 worker [10.0.32.104]) Here's a few outputs from my master node,

  • kubectl get nodes -A
admin@ip-10-0-32-163:~$ kubectl get nodes -A
NAME                                         STATUS     ROLES           AGE   VERSION
ip-10-0-32-104.ap-south-1.compute.internal   NotReady   <none>          15h   v1.29.2
ip-10-0-32-163.ap-south-1.compute.internal   Ready      control-plane   16h   v1.29.2
  • kubectl get pods -A
admin@ip-10-0-32-163:~$ kubectl get pods -A
NAMESPACE     NAME                                                                 READY   STATUS             RESTARTS        AGE
kube-system   aws-cloud-controller-manager-khnq6                                   1/1     Running            1 (72m ago)     16h
kube-system   aws-node-56hf4                                                       1/2     CrashLoopBackOff   7 (4m55s ago)   19m
kube-system   aws-node-ghvzc                                                       2/2     Running            2 (72m ago)     16h
kube-system   coredns-76f75df574-rg724                                             0/1     CrashLoopBackOff   34 (63s ago)    16h
kube-system   coredns-76f75df574-svglz                                             0/1     CrashLoopBackOff   7 (4m43s ago)   22m
kube-system   etcd-ip-10-0-32-163.ap-south-1.compute.internal                      1/1     Running            1 (72m ago)     16h
kube-system   kube-apiserver-ip-10-0-32-163.ap-south-1.compute.internal            1/1     Running            2 (72m ago)     16h
kube-system   kube-controller-manager-ip-10-0-32-163.ap-south-1.compute.internal   1/1     Running            2 (72m ago)     16h
kube-system   kube-proxy-kj778                                                     1/1     Running            1 (72m ago)     15h
kube-system   kube-proxy-xgzzf                                                     1/1     Running            1 (72m ago)     16h
kube-system   kube-scheduler-ip-10-0-32-163.ap-south-1.compute.internal            1/1     Running            1 (72m ago)     16h
  • kubectl describe pods aws-node-56hf4 -n kube-system
Events:
  Type     Reason                 Age                     From               Message
  ----     ------                 ----                    ----               -------
  Warning  MissingIAMPermissions  7m42s (x2 over 7m42s)   aws-node           Unauthorized operation: failed to call ec2:CreateTags due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  6m8s (x2 over 6m9s)     aws-node           Unauthorized operation: failed to call ec2:CreateTags due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  4m38s (x2 over 4m39s)   aws-node           Unauthorized operation: failed to call ec2:CreateTags due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  3m8s (x2 over 3m9s)     aws-node           Unauthorized operation: failed to call ec2:CreateTags due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  98s (x2 over 99s)       aws-node           Unauthorized operation: failed to call ec2:CreateTags due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Warning  MissingIAMPermissions  8s (x2 over 9s)         aws-node           Unauthorized operation: failed to call ec2:CreateTags due to missing permissions. Please refer https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/iam-policy.md to attach relevant policy to IAM role
  Normal   Scheduled              7m46s                   default-scheduler  Successfully assigned kube-system/aws-node-56hf4 to ip-10-0-32-104.ap-south-1.compute.internal
  Normal   Pulled                 7m45s                   kubelet            Container image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/amazon-k8s-cni-init:v1.16.4" already present on machine
  Normal   Created                7m45s                   kubelet            Created container aws-vpc-cni-init
  Normal   Started                7m45s                   kubelet            Started container aws-vpc-cni-init
  Normal   Pulled                 7m44s                   kubelet            Container image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/amazon-k8s-cni:v1.16.4" already present on machine
  Normal   Started                7m44s                   kubelet            Started container aws-eks-nodeagent
  Normal   Created                7m44s                   kubelet            Created container aws-eks-nodeagent
  Normal   Pulled                 7m44s                   kubelet            Container image "602401143452.dkr.ecr.ap-south-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.8" already present on machine
  Normal   Started                7m44s                   kubelet            Started container aws-node
  Normal   Created                7m44s                   kubelet            Created container aws-node
  Warning  Unhealthy              7m38s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:02:54.811Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              7m33s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:02:59.865Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              7m28s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:03:04.915Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              7m20s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:03:12.342Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              7m10s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:03:22.350Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              7m                      kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:03:32.350Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              6m50s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:03:42.342Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              6m40s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:03:52.347Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy              6m30s                   kubelet            Readiness probe failed: {"level":"info","ts":"2024-03-14T05:04:02.344Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Normal   Killing                6m10s                   kubelet            Container aws-node failed liveness probe, will be restarted
  Warning  Unhealthy              2m40s (x43 over 6m30s)  kubelet            (combined from similar events): Readiness probe failed: {"level":"info","ts":"2024-03-14T05:07:52.354Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 5s"
  • kubectl logs coredns-76f7df574-rg724
admin@ip-10-0-32-163:~$ kubectl logs coredns-76f75df574-rg724 -n kube-system
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:46941->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:48624->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:35195->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:36595->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:37395->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:53769->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:39372->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:49266->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[870704998]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:10:50.372) (total time: 30001ms):
Trace[870704998]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:20.374)
Trace[870704998]: [30.001959325s] [30.001959325s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1121138999]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:10:50.372) (total time: 30001ms):
Trace[1121138999]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:20.374)
Trace[1121138999]: [30.001824712s] [30.001824712s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[757947080]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:10:50.373) (total time: 30001ms):
Trace[757947080]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:20.374)
Trace[757947080]: [30.001669002s] [30.001669002s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:59870->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1113266275012896724.8518814352627412410. HINFO: read udp 10.0.32.235:36793->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[308293075]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:21.583) (total time: 30001ms):
Trace[308293075]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:51.584)
Trace[308293075]: [30.00153721s] [30.00153721s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1924537645]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:21.772) (total time: 30001ms):
Trace[1924537645]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:51.773)
Trace[1924537645]: [30.001441343s] [30.001441343s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1601989491]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:21.892) (total time: 30000ms):
Trace[1601989491]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (05:11:51.893)
Trace[1601989491]: [30.000541411s] [30.000541411s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1839797281]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:53.729) (total time: 30002ms):
Trace[1839797281]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30002ms (05:12:23.731)
Trace[1839797281]: [30.002135986s] [30.002135986s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[2131737096]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:54.116) (total time: 30001ms):
Trace[2131737096]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (05:12:24.117)
Trace[2131737096]: [30.001094761s] [30.001094761s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[342939726]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:54.708) (total time: 30001ms):
Trace[342939726]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (05:12:24.709)
Trace[342939726]: [30.001121228s] [30.001121228s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/kubernetes: Trace[731275138]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:12:28.220) (total time: 11342ms):
Trace[731275138]: [11.342820089s] [11.342820089s] END
[INFO] plugin/kubernetes: Trace[1946198945]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:12:28.081) (total time: 11481ms):
Trace[1946198945]: [11.481121164s] [11.481121164s] END
[INFO] plugin/kubernetes: Trace[1707910341]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:12:27.480) (total time: 12082ms):
Trace[1707910341]: [12.082670995s] [12.082670995s] END
  • kubectl logs coredns-76f75df574-svglz -n kube-system
admin@ip-10-0-32-163:~$ kubectl logs coredns-76f75df574-svglz -n kube-system
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.11.1
linux/amd64, go1.20.7, ae2bbc2
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:39153->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:34390->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:34202->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:44007->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:40443->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:47108->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:59620->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:39071->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[244891391]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:24.389) (total time: 30001ms):
Trace[244891391]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:54.390)
Trace[244891391]: [30.001548794s] [30.001548794s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[106582316]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:24.389) (total time: 30002ms):
Trace[106582316]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:54.391)
Trace[106582316]: [30.00208516s] [30.00208516s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1365423089]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:24.389) (total time: 30001ms):
Trace[1365423089]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:11:54.390)
Trace[1365423089]: [30.001969555s] [30.001969555s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:57291->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 1600033383188009841.8067679233946884018. HINFO: read udp 10.0.32.13:52147->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1202752718]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:55.195) (total time: 30000ms):
Trace[1202752718]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (05:12:25.196)
Trace[1202752718]: [30.000482356s] [30.000482356s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[528314086]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:55.738) (total time: 30004ms):
Trace[528314086]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30004ms (05:12:25.742)
Trace[528314086]: [30.00474037s] [30.00474037s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[401932378]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:11:55.919) (total time: 30001ms):
Trace[401932378]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (05:12:25.921)
Trace[401932378]: [30.001416591s] [30.001416591s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1029911745]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:12:27.513) (total time: 30000ms):
Trace[1029911745]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (05:12:57.514)
Trace[1029911745]: [30.000923168s] [30.000923168s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1647125159]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:12:27.996) (total time: 30003ms):
Trace[1647125159]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (05:12:57.997)
Trace[1647125159]: [30.003270334s] [30.003270334s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1397932663]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231 (14-Mar-2024 05:12:28.082) (total time: 30000ms):
Trace[1397932663]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (05:12:58.083)
Trace[1397932663]: [30.000758193s] [30.000758193s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s
  • /var/log/aws-routed-eni/iampd.log:
    ipamd-log.tar.gz
  • /var/log/aws-routed-eni/plugin.log: (worker-node)
{"level":"error","ts":"2024-03-14T04:10:43.568Z","caller":"routed-eni-cni-plugin/cni.go:283","msg":"Error received from DelNetwork gRPC call for container 75d411ca04ea3ea9d079947801458b9938aaf07cbefc8803364c316d28588972: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:50051: connect: connection refused\""}

I did assign ec2:CreateTags permission which seemed missing & I recreated my entire cluster. The rediness and liveness probes still throw same x.x.x.x:xxx -> 10.x.0.x:53 errors and coredns s unable to get ready.

@kwohlfahrt
Copy link
Contributor

Hm, I'm not sure. My only suspicion is you might be hitting #2840 I reported the other day.

You can easily check by connecting to your node and seeing if /run/xtables.lock is a directory - it should be a file. If it is created as a directory, it causes kube-proxy to fail, which means the CNI cannot reach the API server.

You can see the linked PR in that issue for the fix (the volume needs to be defined with type: FileOrCreate), just make sure to SSH to the node and rmdir /run/xtables.lock after applying the fix.

@is-it-ayush
Copy link
Author

is-it-ayush commented Mar 15, 2024

Thank You @kwohlfahrt! I had some missing IAM permissions which I added to master node. It seems though it still hasn't really resolved the problem where "coredns" isn't not being reached apparent from the logs when running kubectl logs coredns-76f75df574-49gs5 -n kube-system. I'm not entirely sure what's causing this.

[ERROR] plugin/errors: 2 4999722014791650549.7690820414208347954. HINFO: read udp 10.0.43.148:57589->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 4999722014791650549.7690820414208347954. HINFO: read udp 10.0.43.148:38940->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout

@is-it-ayush
Copy link
Author

is-it-ayush commented Mar 21, 2024

Update! I was really unable to resolve coredns issues with aws-vpc-cni & aws-cloud-controller-manager. There are multiple issues,

  • It seems like both of them are broken. The controller-manager fails to get providerId from aws cloud for nodes in random order even if you set the hostname to private IPV4 DNS name and add the correct tags. Failing to initialise newly joined nodes or even the master node itself as this leads to the worker nodes getting deleted and master node tainted as NotReady.
  • The coredns pod fails to run regardless of the first issue and there is no way to debug why. The logs collected by /opt/cni/bin/aws-cni-support.sh are not enough to debug the coredns problem.

I switched to cilium and let go of my dream to connect k8s and aws.

@orsenthil
Copy link
Member

[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout

This seems like the coredns pod go the ip-ddress, but it wasn't able to communicate with the API server, due to missing permissions? The nodes/pods should have the ability to communicate with API server with the necessary permissions.

Were you able to narrow down to any permission issue?

@is-it-ayush
Copy link
Author

is-it-ayush commented May 1, 2024

[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.27.4/tools/cache/reflector.go:231: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout

This seems like the coredns pod go the ip-ddress, but it wasn't able to communicate with the API server, due to missing permissions? The nodes/pods should have the ability to communicate with API server with the necessary permissions.

Were you able to narrow down to any permission issue?

Not really! I really did all I could and scanned all of journalctl to find something. I wrote about it here & I couldn't get aws-vpc-cni working as far as I remember. I double checked permissions and instance roles but it didn't seem like they were a problem.

It seems like both of them are broken. The controller-manager fails to get providerId from aws cloud for nodes in random order even if you set the hostname to private IPV4 DNS name and add the correct tags. Failing to initialise newly joined nodes or even the master node itself as this leads to the worker nodes getting deleted and master node tainted as NotReady.
The coredns pod fails to run regardless of the first issue and there is no way to debug why. The logs collected by /opt/cni/bin/aws-cni-support.sh are not enough to debug the coredns problem.

@terryjix
Copy link

terryjix commented May 3, 2024

I am hitting the same issue. the Pod cannot communicate with any endpoints including

  • coredns
  • api server
  • 169.254.169.254
  • etc.

@orsenthil
Copy link
Member

@terryjix - This is question on setting up VPC CNI on a non EKS cluster. How did you go about with this?

@orsenthil
Copy link
Member

Closing this due to lack of more information.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

@wtvamp
Copy link

wtvamp commented Sep 16, 2024

This issue needs to be reopened - it seems to be a fairly ubiquitous issue when attempting to use the amazon-vpc-cni in a non-EKS environment.

I've also encountered it (coredns not able to communicate):

[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = 591cf328cccc12bc490481273e738df59329c62c0b729d94e8b61db9961c2fa5f046dd37f1cf888b953814040d180f52594972691cd6ff41be96639138a43908
CoreDNS-1.11.3
linux/amd64, go1.21.11, a6338e9
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:57241->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:42295->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:33996->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:50361->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:58932->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:35147->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:47365->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:60287->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[2115550610]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:24:38.357) (total time: 30000ms):
Trace[2115550610]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:08.358)
Trace[2115550610]: [30.000916518s] [30.000916518s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[935094613]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:24:38.358) (total time: 30000ms):
Trace[935094613]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:08.358)
Trace[935094613]: [30.000403807s] [30.000403807s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1423531700]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:24:38.358) (total time: 30000ms):
Trace[1423531700]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:08.359)
Trace[1423531700]: [30.000293311s] [30.000293311s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:44224->10.0.0.2:53: i/o timeout
[ERROR] plugin/errors: 2 5717391959630560116.4828385316436471351. HINFO: read udp 10.0.0.75:60914->10.0.0.2:53: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1341126722]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:25:09.591) (total time: 30000ms):
Trace[1341126722]: ---"Objects listed" error:Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:39.592)
Trace[1341126722]: [30.000759936s] [30.000759936s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://10.96.0.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1646410435]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:25:09.695) (total time: 30001ms):
Trace[1646410435]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30001ms (19:25:39.696)
Trace[1646410435]: [30.001364482s] [30.001364482s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Namespace: failed to list *v1.Namespace: Get "https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/kubernetes: Trace[1072212733]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229 (16-Sep-2024 19:25:09.753) (total time: 30000ms):
Trace[1072212733]: ---"Objects listed" error:Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout 30000ms (19:25:39.754)
Trace[1072212733]: [30.000533915s] [30.000533915s] END
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: i/o timeout

@wtvamp
Copy link

wtvamp commented Sep 16, 2024

Closing this due to lack of more information.

@orsenthil Why was this closed? It seems like there's plenty of information and repro steps?

@orsenthil orsenthil reopened this Sep 16, 2024
@orsenthil
Copy link
Member

fairly ubiquitous issue when attempting to use the amazon-vpc-cni in a non-EKS environment.

We will need to reproduce this and investigate. Re-opened.

@wtvamp
Copy link

wtvamp commented Sep 16, 2024

Thanks!

I've got a cluster that reproduces and willing to screen share/support as needed.

@terryjix
Copy link

I've fixed my issue by running vpc-cni-k8s on EKS optimized AMI. vpc-cni-k8s plugin conflicts with ec2-net-utils. ec2-net-utils adds more route rules which broke the pod to pod communication in my case. the EKS optimized ami has optimized this issue.

@wtvamp
Copy link

wtvamp commented Sep 16, 2024

I've fixed my issue by running vpc-cni-k8s on EKS optimized AMI. vpc-cni-k8s plugin conflicts with ec2-net-utils. ec2-net-utils adds more route rules which broke the pod to pod communication in my case. the EKS optimized ami has optimized this issue.

Does this work for even outside EKS? I think this bug was for outside EKS (for example, I'm running self-managed on ubuntu AMIs with kubeadm)

@terryjix
Copy link

terryjix commented Sep 16, 2024

yes, I used kubeadmin to create kubernetes cluster on Amazon Linux 2 ami and found the pod cannot communicate with outside. some strange rules created on route table which overwrites the rules vpc-cni created.

You can find optimized ubuntu ami from https://cloud-images.ubuntu.com/aws-eks/ . Maybe it can fix your issue. You can build your self-managed kubernetes control plan on these amis. The optimized ami has disabled some services may affect network configuration in the OS.

@wtvamp
Copy link

wtvamp commented Sep 17, 2024

yes, I used kubeadmin to create kubernetes cluster on Amazon Linux 2 ami and found the pod cannot communicate with outside. some strange rules created on route table which overwrites the rules vpc-cni created.

You can find optimized ubuntu ami from https://cloud-images.ubuntu.com/aws-eks/ . Maybe it can fix your issue. You can build your self-managed kubernetes control plan on these amis. The optimized ami has disabled some services may affect network configuration in the OS.

It says clearly on the page: These images are customised specifically for the EKS service, and are not intended as general OS images.

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Nov 17, 2024
@wtvamp
Copy link

wtvamp commented Nov 18, 2024

This is issue is not stale and still a blocker for thousands of users in different Internet forums.

@jayanthvn
Copy link
Contributor

@wtvamp -

It seems CoreDNS is unable to reach the Kubernetes API server, and I noticed you’re using an Ubuntu AMI. I have a few questions and suggestions for troubleshooting:

  1. Ubuntu Kernel Version:
    There’s a known compatibility issue with certain Ubuntu kernels. Could you check your kernel version and verify against the guidance here:
    CNI Compatibility
    Known Issues

  2. Subnet IP Assignment:
    From what I can see, CoreDNS appears to have obtained an IP from the subnet. Could you confirm this?

  3. IP Table Rules Verification:
    Have you verified that the pod routes and IP table rules are programmed correctly? If not, you can refer to the CNI Proposal for details. Alternatively, use the script from the AWS EKS troubleshooting guide:
    AWS EKS Troubleshooting Script
    Once you have the logs, feel free to share them with the triage team at k8s-awscni-triage@amazon.com.

  4. Kubernetes Service Endpoint:
    Have you checked whether NAT rules for Kubernetes service endpoints are properly programmed?

  5. Potential Interference:
    Are there any tools like ec2-net-utils running on the nodes that might interfere with the CNI’s routes?

  6. API Server Reachability:
    Can you confirm if the API server is reachable from the node? You can verify this by running a curl command to the API server’s healthz endpoint using:

The service VIP.
The API server's direct IP.

Lastly, I’d like to mention that there are upstream CI jobs running successfully based on non-EKS Amazon Linux 2 (AL2/AL23) AMIs with the AWS VPC CNI plugin, demonstrating compatibility and stability with those environments.

@jayanthvn jayanthvn removed the stale Issue or PR is stale label Nov 18, 2024
@orsenthil
Copy link
Member

Hello @wtvamp ,

I can use vpc-cni on kops cluster following the instructions from

https://kops.sigs.k8s.io/networking/aws-vpc/

 kops validate cluster --wait 10m
Using cluster from kubectl context: cluster3.k8s.local

Validating cluster cluster3.k8s.local

INSTANCE GROUPS
NAME				ROLE		MACHINETYPE	MIN	MAX	SUBNETS
control-plane-us-west-2a	ControlPlane	t3.medium	1	1	us-west-2a
nodes-us-west-2a		Node		t3.medium	1	1	us-west-2a
nodes-us-west-2b		Node		t3.medium	1	1	us-west-2b

NODE STATUS
NAME			ROLE		READY
i-03edbf8be98b99ef2	control-plane	True
i-054516839dd4e9e24	node		True
i-07ed1db4adfc39d01	node		True

Your cluster cluster3.k8s.local is ready
$ kubectl get pods -o wide -A |grep aws-node
kube-system   aws-node-5fxc8                                  2/2     Running   0               5m27s   172.20.8.157     i-03edbf8be98b99ef2   <none>           <none>
kube-system   aws-node-dq6px                                  2/2     Running   0               3m5s    172.20.248.87    i-054516839dd4e9e24   <none>           <none>
kube-system   aws-node-termination-handler-69cf458879-mrn62   1/1     Running   0               5m27s   172.20.8.157     i-03edbf8be98b99ef2   <none>           <none>
kube-system   aws-node-wccl4                                  2/2     Running   0               3m4s    172.20.39.144    i-07ed1db4adfc39d01   <none>           <none>
$ kubectl get pod/aws-node-5fxc8 -n kube-system -o yaml |grep -i image
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.18.1
    imagePullPolicy: IfNotPresent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.1.1
    imagePullPolicy: IfNotPresent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.18.1
    imagePullPolicy: IfNotPresent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.1.1
    imageID: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent@sha256:3eaafed26b90b8447d47ae2dc8d2a112845c25a497a0df14bb70ba51bde0ade8
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.18.1
    imageID: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni@sha256:cfc81b0cf0429742eec8054c376cdc9ea30c1090f362f83e70eb6eca6e155d43
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.18.1
    imageID: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init@sha256:5ae0745bbacb58189c7237c835b63d06e5aad5dd368b038f72e201381c003454

The default kops cluster was using Ubuntu 22.04.04

$ kubectl get nodes -o wide
NAME                  STATUS   ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
i-03edbf8be98b99ef2   Ready    control-plane   7m36s   v1.30.2   172.20.8.157    18.237.174.212   Ubuntu 22.04.4 LTS   6.5.0-1020-aws   containerd://1.7.16
i-054516839dd4e9e24   Ready    node            4m50s   v1.30.2   172.20.248.87   54.185.221.117   Ubuntu 22.04.4 LTS   6.5.0-1020-aws   containerd://1.7.16
i-07ed1db4adfc39d01   Ready    node            4m48s   v1.30.2   172.20.39.144   54.189.212.234   Ubuntu 22.04.4 LTS   6.5.0-1020-aws   containerd://1.7.16

Let me know if you need instructions for this. I will have format it and paste.

@orsenthil
Copy link
Member

orsenthil commented Nov 18, 2024

Cluster Create Instructions for Non EKS Cluster.

export NAME=
export KOPS_STATE_STORE=
export AWS_REGION=

kops create cluster \
    --name=${NAME} \
    --cloud=aws \
    --zones=${AWS_REGION}a,${AWS_REGION}b \
    --networking=amazonvpc \
    --control-plane-size=t3.medium \
    --node-size=t3.medium \
    --node-count=2 \
    --control-plane-volume-size=20 \
    --node-volume-size=20 \
    --ssh-public-key=SSH-PUBLIC-KEY \
    --dry-run \
    --output=yaml > cluster-config.yaml

kops create -f cluster-config.yaml

kops create secret --name ${NAME} sshpublickey admin -i SSH-PUBLIC-KEY

kops update cluster --name ${NAME} --yes

kops export kubecfg --admin

kops validate cluster --wait 10m

@orsenthil
Copy link
Member

Closing this as we have provided the instructions for using vpc-cni on a non eks cluster - #2839 (comment)

If there are any other specific issues, please share a step by step method to reproduce it in a different issue.

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants