Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better error message indicate why cni ipamD is not starting #122

Closed
liwenwu-amazon opened this issue Jul 2, 2018 · 19 comments
Closed
Milestone

Comments

@liwenwu-amazon
Copy link
Contributor

Today, in CNI daemonSet(aws-node) whenever ipamD restart, it query kubernetes API server about Pods already running on the node. If it can not reach kubernetes API server, ipamD will exit and you will see following logs in the /var/log/aws-routed-cni/ipamd.log.xxx

2018-07-02T15:00:33Z [INFO] Starting L-IPAMD 1.0.0 ...
2018-07-02T15:00:33Z [INFO] Testing communication with server

..
2018-07-02T15:00:33Z [INFO] Starting L-IPAMD 1.0.0 ...
2018-07-02T15:00:33Z [INFO] Testing communication with server

ipamD needs print out explicit error that it failed due to it can NOT communicate with API server.

To verify security groups are configured correctly between worker node and kubernetes API server, you can run following commands:

# find out kubernetes service IP
kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   24d

#verify  worker node can reach port 443 of master 
telnet 10.100.0.1 443
Trying 10.100.0.1...
Connected to 10.100.0.1.
Escape character is '^]'.
@kuroneko25
Copy link

I think we are also running into this issue and maybe this one #104. But I don't believe it's because of security groups. I suspect it has something to do with service accounts or RBAC but right now it's extremely difficult to tell what the actual failure is since there is nothing in ipamd.log or the container log to help us debug.

All I'm seeing is the crash loop output you mentioned above and this single line in the container log:

ERROR: logging before flag.Parse: W0615 02:36:27.560592 9 client_config.go:533] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.

which is probably expected since you are invoking clientcmd.BuildConfigFromFlags but not explicitly setting apiserver or kubeconfig.

@liwenwu-amazon
Copy link
Contributor Author

@kuroneko25 , is this a EKS cluster?

@kuroneko25
Copy link

Yes.

@liwenwu-amazon
Copy link
Contributor Author

on your worker node, are you able to get 443 port of master. What's the output of on your worker node

telnet 10.100.0.1 443

@kuroneko25
Copy link

Should I try with the real master IP or in cluster VIP?

@liwenwu-amazon
Copy link
Contributor Author

cluster VIP. the output for

kubectl get svc kubernetes

@kuroneko25
Copy link

kuroneko25 commented Jul 3, 2018

Running this on my worker node (on the host VM not from any containers):

telnet 172.20.0.1 443
Trying 172.20.0.1...

Seems to just hang. I believe security groups are configured correctly.

@liwenwu-amazon
Copy link
Contributor Author

what's the output of your

kubectl get svc kubernetes

In default EKS cluster, the kubernet VIP is 10.100.0.1

@kuroneko25
Copy link

kuroneko25 commented Jul 3, 2018

kubectl get svc kubernetes
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   172.20.0.1   <none>        443/TCP   10d

@liwenwu-amazon
Copy link
Contributor Author

i think you are running security groups issue. In another word, your worker node can NOT reach API server. 172.20.0.1 on port 443.

@liwenwu-amazon
Copy link
Contributor Author

@kuroneko25 the specific security group used for creating eks cluster, which is also returned from:

aws eks describe-cluster --name <your cluster>

It needs to have port 443 open on the inbound rule

@kuroneko25
Copy link

I think we figured out the issue in our SG configuration. Thanks for the help!

@hobbsh
Copy link
Contributor

hobbsh commented Jul 19, 2018

@kuroneko25 Do you mind sharing the solution? I am seeing a similar problem.

@liwenwu-amazon How is this supposed to work exactly if my nodes/pods are on 10.99.x.x and there seems to be no routing to speak of to reach 172.20.x.x?

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             ip-172-20-0-10.us-west-2.compute.internal  /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             ip-172-20-0-10.us-west-2.compute.internal  /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere             ip-172-20-0-1.us-west-2.compute.internal  /* default/kubernetes:https cluster IP */ tcp dpt:https

FWIW:

2018-07-18T23:32:32Z [INFO] Starting L-IPAMD 1.0.0  ...
2018-07-18T23:32:32Z [INFO] Testing communication with server
2018-07-18T23:32:32Z [INFO] Running with Kubernetes cluster version: v1.10. git version: v1.10.3. git tree state: clean. commit: 2bba0127d85d5a46ab4b778548be28623b32d0b0. platform: linux/amd64
2018-07-18T23:32:32Z [INFO] Communication with server successful
2018-07-18T23:32:32Z [INFO] Starting Pod controller
2018-07-18T23:32:32Z [DEBUG] Discovered region: us-west-2
2018-07-18T23:32:32Z [DEBUG] Found avalability zone: us-west-2a 
2018-07-18T23:32:32Z [DEBUG] Discovered the instance primary ip address: 10.99.60.199
2018-07-18T23:32:32Z [DEBUG] Found instance-id: i-019cf65b7dd4f6f6b 
2018-07-18T23:32:32Z [DEBUG] Found instance-type: m4.large 
2018-07-18T23:32:32Z [DEBUG] Found primary interface's mac address: 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Discovered 1 interfaces.
2018-07-18T23:32:32Z [DEBUG] Found device-number: 0 
2018-07-18T23:32:32Z [DEBUG] Found account ID: 140222353192
2018-07-18T23:32:32Z [DEBUG] Found eni: eni-e88037e3 
2018-07-18T23:32:32Z [DEBUG] Found eni eni-e88037e3 is a primary eni
2018-07-18T23:32:32Z [DEBUG] Found security-group id: sg-ac7217dc
2018-07-18T23:32:32Z [DEBUG] Found subnet-id: subnet-d49ffd9f 
2018-07-18T23:32:32Z [DEBUG] Found vpc-ipv4-cidr-block: 10.99.0.0/16 
2018-07-18T23:32:32Z [DEBUG] Total number of interfaces found: 1 
2018-07-18T23:32:32Z [DEBUG] Found eni mac address : 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Using device number 0 for primary eni: eni-e88037e3
2018-07-18T23:32:32Z [DEBUG] Found eni: eni-e88037e3, mac 06:86:7a:da:32:2a, device 0
2018-07-18T23:32:32Z [DEBUG] Found cidr 10.99.60.0/24 for eni 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Found ip addresses [10.99.60.199] on eni 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Discovered ENI eni-e88037e3
2018-07-18T23:32:32Z [INFO] Trying to allocate all available ip addresses on eni: eni-e88037e3
2018-07-18T23:32:32Z [INFO] Synced successfully with APIServer

@liwenwu-amazon liwenwu-amazon added this to the v1.2 milestone Aug 14, 2018
@cjbottaro
Copy link

@liwenwu-amazon One of my nodes can connect to the API server fine, and the other can't. How can I debug this?

admin@ip-10-3-18-165:~$ telnet 10.3.0.1 443
Trying 10.3.0.1...
Connected to 10.3.0.1.
Escape character is '^]'.
admin@ip-10-3-18-164:~$ telnet 10.0.0.1 443
Trying 10.0.0.1...

Both nodes come from the same kops instancegroup, so they both have the same security group.

Thanks for the help.

@liwenwu-amazon
Copy link
Contributor Author

@cjbottaro if this is same kubernetes cluster they should use same kubernete service IP. Can you check which one is the kubernet service IP, 10.3.0.1 or 10.0.0.1

@cjbottaro
Copy link

Gah, I'm sorry... typo. They can both reach API.

@cjbottaro
Copy link

@liwenwu-amazon But one of my pods can't reach the API server...

$ kubectl logs -n kube-system coredns-85fd49598-bpmpl
E0831 00:22:20.565154       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.3.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.3.0.1:443: i/o timeout
E0831 00:22:20.565365       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.3.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.3.0.1:443: i/o timeout
E0831 00:22:20.565629       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.3.0.1:443: i/o timeout

That pod is running on this node which can connect fine:

admin@ip-10-3-18-164:~$ telnet 10.3.0.1 443
Trying 10.3.0.1...
Connected to 10.3.0.1.
Escape character is '^]'.

@liwenwu-amazon
Copy link
Contributor Author

@cjbottaro are u using any HTTP_PROXY? check their setting

@cjbottaro
Copy link

It was all related to this for some reason: kubernetes/kops#2189 (comment)

Once I switched to the suggested image, all networking problems went away.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants