Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes pod is getting IP assigned to main host interface (eth0, eth1, eth2, etc) - what cause routing issues #1094

Closed
michalzxc opened this issue Jul 15, 2020 · 6 comments
Labels

Comments

@michalzxc
Copy link

2020-07-15T15:52:09.541Z [INFO]  auth.handler: authenticating
2020-07-15T15:52:39.541Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=1.389202839
2020-07-15T15:52:40.930Z [INFO]  auth.handler: authenticating
2020-07-15T15:53:10.931Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=1.822171521
2020-07-15T15:53:12.753Z [INFO]  auth.handler: authenticating
2020-07-15T15:53:42.754Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=2.602361107
2020-07-15T15:53:45.356Z [INFO]  auth.handler: authenticating
data                            ote-mab-internal-9fd645c6f-qbmtp                          0/1     Init:0/2           0          55m     10.64.139.211   ip-10-64-185-133.eu-west-1.compute.internal   <none>           <none>
[root@ote001spot02-i-0d36f9c609ba755b3 ~]# ip addr|grep 10.64.139.211
    inet 10.64.139.211/18 brd 10.64.191.255 scope global dynamic eth4

I am not sure, if there is suppose to be mechanism what would make it work, but issue can be fixed by hand, by changing host IP to one of the unused ones attached to host:

[root@ote001spot02-i-0d36f9c609ba755b3 ~]# ip addr del 10.64.139.211/18 dev eth4
[root@ote001spot02-i-0d36f9c609ba755b3 ~]# ip addr add 10.64.138.69/18 dev eth4
[root@ote001spot02-i-0d36f9c609ba755b3 ~]# ip route add default via 10.64.128.1 dev eth4 table 6
  1. Removed duplicated IP between host and pod from host
  2. Added to host interface one of unused IPs (found it via curl http://localhost:61679/v1/enis | python -m json.tool )
  3. Restored routing (deleted together with host main IP)

And that fixed it:

2020-07-15T15:53:12.753Z [INFO]  auth.handler: authenticating
2020-07-15T15:53:42.754Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=2.602361107
2020-07-15T15:53:45.356Z [INFO]  auth.handler: authenticating
2020-07-15T15:54:15.357Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=2.872125217
2020-07-15T15:54:18.229Z [INFO]  auth.handler: authenticating
2020-07-15T15:54:48.230Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=1.852514035
2020-07-15T15:54:50.083Z [INFO]  auth.handler: authenticating
2020-07-15T15:55:00.123Z [INFO]  auth.handler: authentication successful, sending token to sinks
2020-07-15T15:55:00.123Z [INFO]  auth.handler: starting renewal process
2020-07-15T15:55:00.124Z [INFO]  sink.file: token written: path=/vault/.vault-token
2020-07-15T15:55:00.124Z [INFO]  sink.server: sink server stopped
2020-07-15T15:55:00.124Z [INFO]  sinks finished, exiting

@michalzxc michalzxc changed the title Sometimes pod is getting IP assigned to main host interface (eth0, eth1, eth2, etc) Sometimes pod is getting IP assigned to main host interface (eth0, eth1, eth2, etc) - what cause routing issues Jul 15, 2020
@mogren
Copy link
Contributor

mogren commented Jul 15, 2020

Hi @michalzxc,

What do you mean by "pod is getting IP assigned to main host interface"? The first IP on every ENI is ignored by the CNI. Pods with hostNetworking: true will get a port mapped on the node IP, same as the first IP of eth0. Was this the case here?

If you could run sudo /opt/cni/bin/aws-cni-support.sh on the worker node where you saw the issue and share them with me I could check the logs.

@michalzxc
Copy link
Author

michalzxc commented Jul 16, 2020

The node I send logs from was spot and it is gone now, I will upload it when I will see it happening again - I saw it like 3 times this week already.
It wasn't hostNetworking:true pod, pod got IP of ENI (eth4 - 10.64.139.211)

@michalzxc
Copy link
Author

@mogren Got it

sys-maintenance                 sys-maintenance-ipa-host-cleaner-1594929600-czsrb         0/1     Init:0/2           0          16h     10.64.40.127    ip-10-64-19-153.eu-west-1.compute.internal    <none>           <none>

This pod (not hostNetworking:true) got IP of eht0 interface:

[root@ote001spot03-i-09a17344ff3b19d02 ~]# ip addr|grep 10.64.40.127
    inet 10.64.40.127/18 brd 10.64.63.255 scope global dynamic eth0

aws-cni-support result:
https://drive.google.com/file/d/1puqH8XKd-8uSU6wZW7w3EG4QxDUPLnyY/view?usp=sharing

@mogren
Copy link
Contributor

mogren commented Jul 17, 2020

@michalzxc Thanks a lot! Will take a look at the logs today.

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Aug 28, 2020

as per the logs attached
We have three ENI on this instance
Primary ENI: eni-08fee179d6f3ecfd8 - Primary IP 10.64.19.153
Secondary ENIs
eni-08252f83ef3b0d34f (deviceNumber 2) - Primary IP - 10.64.40.127
eni-0bf15482663df5aec (deviceNumber 3) - Primary IP - 10.64.12.167


As per logs, ENI and IPs are provisioned but completing the setup failed. But ENI is already added to DS only thing missing is adding PrimaryIP to datastore.

This line was not executed https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.7.1/pkg/ipamd/ipamd.go#L781

{"level":"debug","ts":"2020-07-16T10:19:51.887Z","caller":"retry/retry.go:69","msg":"Not able to set route 0.0.0.0/0 via 10.64.0.1 table 2"}
.
.
.
{"level":"error","ts":"2020-07-16T10:19:56.314Z","caller":"ipamd/ipamd.go:677","msg":"Failed to increase pool size: failed to set up ENI eni-08252f83ef3b0d34f network: setupENINetwork: unable to replace route entry 0.0.0.0: network is unreachable"}

Now when ipamd found that ip pool needs to be increased, it added all the IPs to the Datastore and started allocating the IPs which also included the primary IP of the ENI.

{"level":"debug","ts":"2020-07-16T10:20:01.316Z","caller":"ipamd/ipamd.go:514","msg":"Starting to increase IP pool size"}

{"level":"debug","ts":"2020-07-16T10:21:21.375Z","caller":"ipamd/ipamd.go:508","msg":"Reconciling ENI/IP pool info because time since last 1m2.529627689s <= 1m0s"}
.
.
.
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"awsutils/awsutils.go:463","msg":"Found IP addresses [10.64.40.127 10.64.37.112 10.64.20.161 10.64.62.83 10.64.39.244 10.64.5.87 10.64.48.105 10.64.46.10 10.64.34.141 10.64.6.222] on ENI 06:53:f7:50:a5:f0"}
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:508","msg":"Reconcile existing ENI eni-08fee179d6f3ecfd8 IP pool"}
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:997","msg":"Reconcile and skip primary IP 10.64.19.153 on ENI eni-08fee179d6f3ecfd8"}
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:508","msg":"Reconcile existing ENI eni-0bf15482663df5aec IP pool"}
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:997","msg":"Reconcile and skip primary IP 10.64.12.167 on ENI eni-0bf15482663df5aec"}
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:508","msg":"Reconcile existing ENI eni-08252f83ef3b0d34f IP pool"}
{"level":"info","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:1075","msg":"Added ENI(eni-08252f83ef3b0d34f)'s IP 10.64.40.127 to datastore"}
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:508","msg":"Successfully Reconciled ENI/IP pool"}
{"level":"debug","ts":"2020-07-16T10:21:21.426Z","caller":"ipamd/ipamd.go:508","msg":"IP Address Pool stats: total: 28, assigned: 19"}

Possible solutions:

Ideally we shouldn't add it to DS unless the network is setup completely.

@mogren
Copy link
Contributor

mogren commented Sep 2, 2020

This has been resolved in #1177 and will be included in the next release.

@mogren mogren closed this as completed Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants