Add better error message indicate why cni ipamD is not starting #122

liwenwu-amazon · 2018-07-02T20:59:31Z

Today, in CNI daemonSet(aws-node) whenever ipamD restart, it query kubernetes API server about Pods already running on the node. If it can not reach kubernetes API server, ipamD will exit and you will see following logs in the /var/log/aws-routed-cni/ipamd.log.xxx

2018-07-02T15:00:33Z [INFO] Starting L-IPAMD 1.0.0 ...
2018-07-02T15:00:33Z [INFO] Testing communication with server

..
2018-07-02T15:00:33Z [INFO] Starting L-IPAMD 1.0.0 ...
2018-07-02T15:00:33Z [INFO] Testing communication with server

ipamD needs print out explicit error that it failed due to it can NOT communicate with API server.

To verify security groups are configured correctly between worker node and kubernetes API server, you can run following commands:

# find out kubernetes service IP
kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   24d

#verify  worker node can reach port 443 of master 
telnet 10.100.0.1 443
Trying 10.100.0.1...
Connected to 10.100.0.1.
Escape character is '^]'.

The text was updated successfully, but these errors were encountered:

kuroneko25 · 2018-07-02T23:58:22Z

I think we are also running into this issue and maybe this one #104. But I don't believe it's because of security groups. I suspect it has something to do with service accounts or RBAC but right now it's extremely difficult to tell what the actual failure is since there is nothing in ipamd.log or the container log to help us debug.

All I'm seeing is the crash loop output you mentioned above and this single line in the container log:

ERROR: logging before flag.Parse: W0615 02:36:27.560592 9 client_config.go:533] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.

which is probably expected since you are invoking clientcmd.BuildConfigFromFlags but not explicitly setting apiserver or kubeconfig.

liwenwu-amazon · 2018-07-03T00:04:12Z

@kuroneko25 , is this a EKS cluster?

kuroneko25 · 2018-07-03T00:04:52Z

Yes.

liwenwu-amazon · 2018-07-03T00:06:14Z

on your worker node, are you able to get 443 port of master. What's the output of on your worker node

telnet 10.100.0.1 443

kuroneko25 · 2018-07-03T00:07:46Z

Should I try with the real master IP or in cluster VIP?

liwenwu-amazon · 2018-07-03T00:08:38Z

cluster VIP. the output for

kubectl get svc kubernetes

kuroneko25 · 2018-07-03T00:11:22Z

Running this on my worker node (on the host VM not from any containers):

telnet 172.20.0.1 443
Trying 172.20.0.1...

Seems to just hang. I believe security groups are configured correctly.

liwenwu-amazon · 2018-07-03T00:13:38Z

what's the output of your

kubectl get svc kubernetes

In default EKS cluster, the kubernet VIP is 10.100.0.1

kuroneko25 · 2018-07-03T00:14:17Z

kubectl get svc kubernetes
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   172.20.0.1   <none>        443/TCP   10d

liwenwu-amazon · 2018-07-03T00:18:58Z

i think you are running security groups issue. In another word, your worker node can NOT reach API server. 172.20.0.1 on port 443.

liwenwu-amazon · 2018-07-03T00:28:46Z

@kuroneko25 the specific security group used for creating eks cluster, which is also returned from:

aws eks describe-cluster --name <your cluster>

It needs to have port 443 open on the inbound rule

kuroneko25 · 2018-07-15T22:56:35Z

I think we figured out the issue in our SG configuration. Thanks for the help!

hobbsh · 2018-07-19T04:31:04Z

@kuroneko25 Do you mind sharing the solution? I am seeing a similar problem.

@liwenwu-amazon How is this supposed to work exactly if my nodes/pods are on 10.99.x.x and there seems to be no routing to speak of to reach 172.20.x.x?

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination         
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             ip-172-20-0-10.us-west-2.compute.internal  /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             ip-172-20-0-10.us-west-2.compute.internal  /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere             ip-172-20-0-1.us-west-2.compute.internal  /* default/kubernetes:https cluster IP */ tcp dpt:https

FWIW:

2018-07-18T23:32:32Z [INFO] Starting L-IPAMD 1.0.0  ...
2018-07-18T23:32:32Z [INFO] Testing communication with server
2018-07-18T23:32:32Z [INFO] Running with Kubernetes cluster version: v1.10. git version: v1.10.3. git tree state: clean. commit: 2bba0127d85d5a46ab4b778548be28623b32d0b0. platform: linux/amd64
2018-07-18T23:32:32Z [INFO] Communication with server successful
2018-07-18T23:32:32Z [INFO] Starting Pod controller
2018-07-18T23:32:32Z [DEBUG] Discovered region: us-west-2
2018-07-18T23:32:32Z [DEBUG] Found avalability zone: us-west-2a 
2018-07-18T23:32:32Z [DEBUG] Discovered the instance primary ip address: 10.99.60.199
2018-07-18T23:32:32Z [DEBUG] Found instance-id: i-019cf65b7dd4f6f6b 
2018-07-18T23:32:32Z [DEBUG] Found instance-type: m4.large 
2018-07-18T23:32:32Z [DEBUG] Found primary interface's mac address: 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Discovered 1 interfaces.
2018-07-18T23:32:32Z [DEBUG] Found device-number: 0 
2018-07-18T23:32:32Z [DEBUG] Found account ID: 140222353192
2018-07-18T23:32:32Z [DEBUG] Found eni: eni-e88037e3 
2018-07-18T23:32:32Z [DEBUG] Found eni eni-e88037e3 is a primary eni
2018-07-18T23:32:32Z [DEBUG] Found security-group id: sg-ac7217dc
2018-07-18T23:32:32Z [DEBUG] Found subnet-id: subnet-d49ffd9f 
2018-07-18T23:32:32Z [DEBUG] Found vpc-ipv4-cidr-block: 10.99.0.0/16 
2018-07-18T23:32:32Z [DEBUG] Total number of interfaces found: 1 
2018-07-18T23:32:32Z [DEBUG] Found eni mac address : 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Using device number 0 for primary eni: eni-e88037e3
2018-07-18T23:32:32Z [DEBUG] Found eni: eni-e88037e3, mac 06:86:7a:da:32:2a, device 0
2018-07-18T23:32:32Z [DEBUG] Found cidr 10.99.60.0/24 for eni 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Found ip addresses [10.99.60.199] on eni 06:86:7a:da:32:2a
2018-07-18T23:32:32Z [DEBUG] Discovered ENI eni-e88037e3
2018-07-18T23:32:32Z [INFO] Trying to allocate all available ip addresses on eni: eni-e88037e3
2018-07-18T23:32:32Z [INFO] Synced successfully with APIServer

cjbottaro · 2018-08-31T00:13:28Z

@liwenwu-amazon One of my nodes can connect to the API server fine, and the other can't. How can I debug this?

admin@ip-10-3-18-165:~$ telnet 10.3.0.1 443
Trying 10.3.0.1...
Connected to 10.3.0.1.
Escape character is '^]'.

admin@ip-10-3-18-164:~$ telnet 10.0.0.1 443
Trying 10.0.0.1...

Both nodes come from the same kops instancegroup, so they both have the same security group.

Thanks for the help.

liwenwu-amazon · 2018-08-31T00:19:46Z

@cjbottaro if this is same kubernetes cluster they should use same kubernete service IP. Can you check which one is the kubernet service IP, 10.3.0.1 or 10.0.0.1

cjbottaro · 2018-08-31T00:22:25Z

Gah, I'm sorry... typo. They can both reach API.

cjbottaro · 2018-08-31T00:25:07Z

@liwenwu-amazon But one of my pods can't reach the API server...

$ kubectl logs -n kube-system coredns-85fd49598-bpmpl
E0831 00:22:20.565154       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.3.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.3.0.1:443: i/o timeout
E0831 00:22:20.565365       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.3.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.3.0.1:443: i/o timeout
E0831 00:22:20.565629       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.3.0.1:443: i/o timeout

That pod is running on this node which can connect fine:

admin@ip-10-3-18-164:~$ telnet 10.3.0.1 443
Trying 10.3.0.1...
Connected to 10.3.0.1.
Escape character is '^]'.

liwenwu-amazon · 2018-08-31T01:09:24Z

@cjbottaro are u using any HTTP_PROXY? check their setting

cjbottaro · 2018-08-31T01:42:05Z

It was all related to this for some reason: kubernetes/kops#2189 (comment)

Once I switched to the suggested image, all networking problems went away.

Thanks!

liwenwu-amazon mentioned this issue Jul 2, 2018

CNI plugin 1.0.0 crashes while 0.1.4 works fine in the same env #104

Closed

liwenwu-amazon added this to the v1.2 milestone Aug 14, 2018

liwenwu-amazon mentioned this issue Sep 10, 2018

Add more error messages during initialization #174

Merged

liwenwu-amazon closed this as completed Sep 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add better error message indicate why cni ipamD is not starting #122

Add better error message indicate why cni ipamD is not starting #122

liwenwu-amazon commented Jul 2, 2018

kuroneko25 commented Jul 2, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018 •

edited

Loading

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018 •

edited

Loading

liwenwu-amazon commented Jul 3, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 15, 2018

hobbsh commented Jul 19, 2018 •

edited

Loading

cjbottaro commented Aug 31, 2018

liwenwu-amazon commented Aug 31, 2018

cjbottaro commented Aug 31, 2018

cjbottaro commented Aug 31, 2018

liwenwu-amazon commented Aug 31, 2018

cjbottaro commented Aug 31, 2018

Add better error message indicate why cni ipamD is not starting #122

Add better error message indicate why cni ipamD is not starting #122

Comments

liwenwu-amazon commented Jul 2, 2018

kuroneko25 commented Jul 2, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018 • edited Loading

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 3, 2018 • edited Loading

liwenwu-amazon commented Jul 3, 2018

liwenwu-amazon commented Jul 3, 2018

kuroneko25 commented Jul 15, 2018

hobbsh commented Jul 19, 2018 • edited Loading

cjbottaro commented Aug 31, 2018

liwenwu-amazon commented Aug 31, 2018

cjbottaro commented Aug 31, 2018

cjbottaro commented Aug 31, 2018

liwenwu-amazon commented Aug 31, 2018

cjbottaro commented Aug 31, 2018

kuroneko25 commented Jul 3, 2018 •

edited

Loading

kuroneko25 commented Jul 3, 2018 •

edited

Loading

hobbsh commented Jul 19, 2018 •

edited

Loading