Kubernetes service IP / container DNAT broken in 1262.0.0 #1743

bison · 2017-01-02T23:33:32Z

Issue Report

Under Kubernetes, pod to pod communication via service IP within a
single node is broken in latest CoreOS alpha (1262.0.0). Downgrading
to previous alpha resolves the issue. The issue is not specific to
Kubernetes however.

Bug

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1262.0.0
VERSION_ID=1262.0.0
BUILD_ID=2016-12-14-2334
PRETTY_NAME="CoreOS 1262.0.0 (Ladybug)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

Observed on both bare metal and Vagrant + VirtualBox locally.

Expected Behavior

In Kubernetes, I can define a service which targets a set of pods and
makes them reachable via a virtual IP. With kube-proxy running in
iptables mode, Kubernetes will configure NAT rules to redirect traffic
destined for that virtual IP to the individual pods. That should work
for traffic originating from any node in the cluster.

Actual Behavior

With a cluster running on the latest alpha (1262.0.0), pods cannot be
reached via their service IP when the traffic originates from another
pod on the same node as the destination.

The following does work on the same node:

Host to service IP
Pods running in the host net namespace to service IP

Reproduction Steps

This isn't actually specific to Kubernetes, I have an alpha-nat
branch of coreos-vagrant that will start a VM with a Docker container
and iptables rules similar to what Kubernetes uses -- along with trace
rules for debugging.

Check out that branch and run either ./start-broken.sh or
./start-working.sh then vagrant ssh into the VM and run the
following:

# Works from host
core@core-01 ~ $ curl http://10.3.0.100:8080
CLIENT VALUES:
client_address=10.0.2.15
...


# Fails from another container on broken version
core@core-01 ~ $ docker run --rm busybox wget -O- -T5 http://10.3.0.100:8080
Connecting to 10.3.0.100:8080 (10.3.0.100:8080)
wget: download timed out

You could also use cloud-config in the user-data file on that branch
on any other platform.

Another option is to launch a Kubernetes cluster with a single
scheduleable node and start a pod with an accompanying service. Other
pods on the same node will not be able to communicate using the
service IP. I've been using this for testing.

Other Information

Starting the same echoheaders container under rkt with the default
ptp networking and configuring similar NAT rules seems to work as
expected from other containers, so this might only be happening when
attaching containers to a bridge.

coreos/coreos-overlay#2300 landed in 1262.0.0 -- I tried not marking
the interfaces unmanaged with overrides in /etc/systemd/network/ but
it didn't seem to help.

The text was updated successfully, but these errors were encountered:

martynd · 2017-01-03T17:05:40Z

I encountered the same issue doing a straight upgrade from 1185.2.0 to 1262.0.0.

Everything worked perfectly except the aforementioned service connectivity issues from within containers (Host machine to service, other machine to service, direct ip from container etc all worked).

Rolling back to 1185.2.0 worked after deleting /var/lib/docker/network/files/local-kv.db

bison · 2017-01-04T00:34:50Z

If I didn't screw up the git bisect, I think this was introduced in torvalds/linux@e3b37f1. That patch seems to have caused a few issues which have since been fixed. The tip of master, 4.10.0-rc2-0f64df3, is working as expected for me.

crawford · 2017-01-05T05:46:32Z

Should be fixed by coreos/coreos-overlay#2353.

crawford · 2017-01-18T20:01:12Z

This is still present in 4.9.3.

Raffo · 2017-02-20T14:03:02Z

It looks like we are still experiencing this issue with the kernel 4.9.9, coreOS alpha 1325.0.0.

crawford · 2017-02-20T17:26:31Z

/cc @bgilbert

bgilbert · 2017-02-21T19:04:48Z

@Raffo If you run

sudo iptables -P FORWARD ACCEPT

in the host, does that fix the issue for you?

Raffo · 2017-02-21T20:25:50Z

I didn't try that and I can, but it looks unrelated. We solved the issue with a change in configuration. Removing the flag "--iptables=false" from the docker setting "fixed" the problems. The effect that we were seeing before was a missing NAT on the response coming from the pod running on the same host.

bgilbert · 2017-02-21T20:57:07Z

@Raffo --iptables=false shouldn't be necessary, especially if you're using a network plugin such as kubenet or CNI. It sounds as though what you're seeing is not the kernel bug reported in this issue, so I'll close. If you believe you're seeing incorrect behavior from Container Linux, please open a new issue.

Raffo · 2017-02-21T21:10:21Z

I agree on closing here.
To clarify, we removed --iptables=false, which means it's true by default. With the setting to false, the networking is broken. I suspect the issue includes also flannel, in which repository do you think I should open it?

bgilbert · 2017-02-21T22:15:55Z

@Raffo Whenever you're unsure where an issue belongs, create it here.

crawford added area/usability component/kernel kind/regression priority/P0 team/os version/linux/4.9.0 labels Jan 3, 2017

crawford self-assigned this Jan 5, 2017

crawford mentioned this issue Jan 5, 2017

sys-kernel/coreos-*: bump to 4.8.15 coreos/coreos-overlay#2353

Merged

crawford added this to the CoreOS Alpha 1284.0.0 milestone Jan 5, 2017

crawford closed this as completed Jan 5, 2017

jbw976 mentioned this issue Jan 5, 2017

DNS resolution failing from pods coreos/coreos-kubernetes#794

Open

crawford mentioned this issue Feb 14, 2017

Rebase CoreOS patches onto Linux v4.9.9 coreos/linux#39

Merged

Raffo mentioned this issue Feb 20, 2017

Switch back to CoreOS stable zalando-incubator/kubernetes-on-aws#276

Closed

crawford reopened this Feb 20, 2017

bgilbert closed this as completed Feb 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes service IP / container DNAT broken in 1262.0.0 #1743

Kubernetes service IP / container DNAT broken in 1262.0.0 #1743

bison commented Jan 2, 2017

martynd commented Jan 3, 2017

bison commented Jan 4, 2017

crawford commented Jan 5, 2017

crawford commented Jan 18, 2017

Raffo commented Feb 20, 2017 •

edited

Loading

crawford commented Feb 20, 2017

bgilbert commented Feb 21, 2017

Raffo commented Feb 21, 2017

bgilbert commented Feb 21, 2017

Raffo commented Feb 21, 2017

bgilbert commented Feb 21, 2017

Kubernetes service IP / container DNAT broken in 1262.0.0 #1743

Kubernetes service IP / container DNAT broken in 1262.0.0 #1743

Comments

bison commented Jan 2, 2017

Issue Report

Bug

CoreOS Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

martynd commented Jan 3, 2017

bison commented Jan 4, 2017

crawford commented Jan 5, 2017

crawford commented Jan 18, 2017

Raffo commented Feb 20, 2017 • edited Loading

crawford commented Feb 20, 2017

bgilbert commented Feb 21, 2017

Raffo commented Feb 21, 2017

bgilbert commented Feb 21, 2017

Raffo commented Feb 21, 2017

bgilbert commented Feb 21, 2017

Raffo commented Feb 20, 2017 •

edited

Loading