Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort Connectivity Issue #231

Closed
nickdgriffin opened this issue Nov 14, 2018 · 13 comments
Closed

NodePort Connectivity Issue #231

nickdgriffin opened this issue Nov 14, 2018 · 13 comments
Labels
calico Calico integration issue

Comments

@nickdgriffin
Copy link

nickdgriffin commented Nov 14, 2018

Hello,

We are experiencing an issue that is functionally identical to #75, using 1.2.1, where certain pods are not accessible from their NodePort on remote hosts and a tcpdump shows a SYN/SYN_ACK at the start followed by TCP retransmissions. As the mentioned ticket is about the rp_filter, here are the values collected by the support bundler:

/proc/sys/net/ipv4/conf/all/rp_filter = 1
/proc/sys/net/ipv4/conf/default/rp_filter = 1
/proc/sys/net/ipv4/conf/eth0/rp_filter = 2

We have this popping up across different clusters (although they are all identical in terms of setup) after pods are created (in this case Nginx, but it may be happening elsewhere) and the remedy is to delete the pods until the NodePort is functioning correctly.

I can send the support bundle and packet traces by email, and anything else that would help in identifying the cause of this and what we can do about it as it is quite problematic for us.

Thanks,
Nick

@nickdgriffin
Copy link
Author

nickdgriffin commented Nov 15, 2018

Having disabled the rp_filter across all interfaces, such that with log_martians enabled there are no messages in /var/log/messages, the issue still occurs. Without changing the rp_filter settings and enabling log_martians there were messages coming in for the secondary interfaces (which are set to loose by default), but that doesn't seem to be related to this problem.

I can confirm that it only seems to occur with pods that have IPs on secondary interfaces and I can easily reproduce this by killing pods until they switch to eth0 where the NodePort will work, then killing them so they appear on another interface for it to fail again.

@nickdgriffin
Copy link
Author

Cracked it.

So, the problem stems from the fact that the secondary interfaces that are added still have the source/destination check enabled which must result in the ENI dropping the return packets from the pod. This can be proven by disabling the check on the ENI that the pod has an IP allocated on, and connections succeed.

I will be submitting a PR to disable the check when ENIs are allocated.

@nickdgriffin
Copy link
Author

Worth mentioning that this issue was brought out by setting WARM_ENI_TARGET to 20 as has been advised in numerous places to prevent issues scheduling containers on hosts that have run out of IPs, which meant that all nodes had the maximum number of ENIs attached which increased the chance that the CNI picked an IP that was associated with a secondary interface instead of the primary.

@ikatson
Copy link
Contributor

ikatson commented Nov 30, 2018

Have the same problem with 1.3.

The first packet hits eth0 (in my case 100.122.192.148) for the node port, gets DNAT'ed to the respective container on eth1 (in my case container ip is 100.122.199.244 which is on eth1).
Container reply gets routed back over eth1 because of routing rule like "from 100.122.199.244 lookup 2"

I was trying to figure out why does the reverse of DNAT is not applied before the routing decision, but could not figure this out, probably that's just how linux works. I.e. if the routing rule saw the un-DNATE'ed ip 100.122.192.148, it would get returned back over eth0 how it came from.

Will try that PR @nickdgriffin to alter the ENIs, thanks for it :)

@ikatson
Copy link
Contributor

ikatson commented Nov 30, 2018

Oh, actually, sounds like it's more of a bug in how amazon-vpc-cni-k8s works with Calico.
I noticed, that this specific problem was fixed in #75.

However, looking at the mangle table, looks like the rules added by #75 are not reached, because calico intercepts them and ACCEPTs before the "CONNMARK restore" rule is triggered.

Here's how my rules in mangle table look like:

Chain PREROUTING (policy ACCEPT 76 packets, 5405 bytes)
 pkts bytes target     prot opt in     out     source               destination
 6482   16M cali-PREROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:6gwbT8clXdHdC1b1 */
16982 6853K CONNMARK   all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* AWS, primary ENI */ ADDRTYPE match dst-type LOCAL limit-in CONNMARK or 0x80
 7046   19M CONNMARK   all  --  eni+   *       0.0.0.0/0            0.0.0.0/0            /* AWS, primary ENI */ CONNMARK restore mask 0x80
Chain cali-PREROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
7921K   21G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:6BJqBjBC7crtA-7- */ ctstate RELATED,ESTABLISHED
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:KX7AGNd6rMcDUai6 */ mark match 0x10000/0x10000
 207K   18M ACCEPT     all  --  eni+   *       0.0.0.0/0            0.0.0.0/0            /* cali:CSdpoDToedBYIZRl */
94920 6900K cali-from-host-endpoint  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:grcMAjdqFPVoXgMC */
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:cEp_17JkV0nk977D */ /* Host endpoint policy accepted packet. */ mark match 0x10000/0x10000

Looked at the TRACE for the return packet, and it gets terminated at cali-PREROUTING:1, so never reaches AWS rules.

@nickdgriffin
Copy link
Author

nickdgriffin commented Nov 30, 2018

@ikatson you can change the adapters in the AWS console/via the CLI but you have to do it each time they are replaced - that's all I've done to fix the issue I was having.

@ikatson
Copy link
Contributor

ikatson commented Nov 30, 2018

@nickdgriffin yeah, that's how I tried your fix without actually compiling it, just changed ENIs in-place. It does work! However, I think it's probably better to fix the root cause, so that the return packets are routed through the same interface they came in. That specific problem was fixed in #75, but in my setup at least Calico network policy makes that fix not work.

I must note btw that I have AWS_VPC_K8S_CNI_EXTERNALSNAT=true. I had problems related to martian packets and rp_filter like you described, when I had AWS_VPC_K8S_CNI_EXTERNALSNAT=false.

@jwalters-gpsw
Copy link

We seemed to be experiencing the same problem (we also have Calico). But I don't understand the workaround and whether it works with Calico (and with the default setting of AWS_VPC_K8S_CNI_EXTERNALSNAT). Can someone summarize?

@ikatson
Copy link
Contributor

ikatson commented Dec 8, 2018

@jwalters-gpsw I was able to fix (not even workaround!) the calico issue by setting this value in calico environment variables (need to be set both in node and typha):

- name: FELIX_IPTABLESMANGLEALLOWACTION
  value: Return

The default value for this is "Accept", so by default calico accepts the established packets and they stop traversing the mangle table.

Need to file a PR for that.

@greenboxal
Copy link

greenboxal commented Jan 9, 2019

I'm having a similar problem here. A TCP ELB (LoadBalancer service) is getting a lot of retransmissions, dupped packets, etc. This causes the health check to fail randomly and actual connections to drop, even established ones.

We're running 1.2.1 and we're not using Calico or anything else on the networking stack. The cluster was created with kops.

Do you think we're talking about the same issue here? I tried disabling the srcdst check on the secondary ENI, but nothing changed.

Let me know if I can collect any data in order to help.

Update:
I was running some packet captures and found this out:

Given a node A, with primary private IP address ABC.
Given a node B, with primary private IP address DEF.
Given a pod X, with IP address XYZ (associated with a secondary ENI through amazon-vpc-cni-k8s), running on node A.
Given a LoadBalancer service, that points to pod X and similar pods.
Given a NodePort created automatically by K8S so the LoadBalancer works.
Given an ELB, created by K8S, pointing to all instances on the created NodePort.

  1. The ELB tries to reach any instance in the cluster.
  2. The ELB picks node B and sends traffic to the NodePort.
  3. The NodePort is implemented as a DNAT rule, forwards traffic to IP XYZ (pod X, node A).
  4. Node A receives TCP segment on eth1, processes the packet.
  5. Node A tries to send a reply, it returns through eth0 with Src IP ABC.

I don't think this is right, if the packet came from eth1, it should return through eth1, right?

Should I create a new issue? Is it related?

@nickdgriffin
Copy link
Author

It sounds similar in terms of behaviour, but if you aren't using Calico for network policies I don't think it can be the same - plus my issue was specifically sorted out by changing the src/dest check, and once it's in a release I'll be testing the fix in #263 out too.

@tabern tabern modified the milestone: v1.5 Mar 5, 2019
@tabern tabern added the calico Calico integration issue label Mar 5, 2019
@tustvold
Copy link
Contributor

tustvold commented Mar 7, 2019

Not sure whether to open this as a separate issue but the rule added to the routing policy database is subtly wrong, in that it doesn't account for additional bits that might be set in the fwmask - for example by calico.

The rule is currently

from all fwmark 0x80 lookup main

When it should probably be

from all fwmark 0x80/0x80 lookup main

Update

In fact on closer inspection v1.3.0 works correctly, but master is currently broken. I guess that's what I get for running the bleeding edge but I needed the ENIConfig changes...

Current released version - https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.3/pkg/networkutils/network.go#L220

Master - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/networkutils/network.go#L244

Happy to submit an MR to fix.

@tylux
Copy link

tylux commented Apr 4, 2019

Cracked it.

So, the problem stems from the fact that the secondary interfaces that are added still have the source/destination check enabled which must result in the ENI dropping the return packets from the pod. This can be proven by disabling the check on the ENI that the pod has an IP allocated on, and connections succeed.

I will be submitting a PR to disable the check when ENIs are allocated.

This helped the issue I was getting through my ELBs, I would get random high response times to my services when using Load Balancer Protocol TCP/SSL (for websockets) as soon as I disabled source/dest check on ALL ENI's the problem went away.

I would switch to NLBs but waiting for K8s to support attaching certs to NLBs.

@mogren mogren closed this as completed Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
calico Calico integration issue
Projects
None yet
Development

No branches or pull requests

8 participants