No networking in some docker containers when spawned at high rate #1638

rawouter · 2016-10-31T19:59:22Z

Issue Report

Bug

When spawning docker container at high rate in network mode bridge, some containers will not have networking connectivity. Coming from moby/moby#27808 asked to report the bug in CoreOS

CoreOS Version

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1192.2.0
VERSION_ID=1192.2.0
BUILD_ID=2016-10-21-0026
PRETTY_NAME="CoreOS 1192.2.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

VM running on VMWare

Steps to reproduce the issue:

Created a network with /16 address space, in bridge mode
Spawn container at high rate and try to access network (ping)

I can reproduce the issue with the following script, some container image work better but will fail at some point. Also, I can not reproduce the issue in host network mode.

for num in range {1..300}
do
        docker run --network taskers --rm  ubuntu:14.04 sh -c "ping -c 1 173.36.21.105; arp -n; ifconfig" | tee output_$num.txt &
done

Describe the results you received:

About 1% of the container will fail without networking.
Here some command outputs taken from the script above:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
From 173.77.2.18 icmp_seq=1 Destination Host Unreachable

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Address                  HWtype  HWaddress           Flags Mask            Iface
173.77.0.1                       (incomplete)                              eth0

eth0      Link encap:Ethernet  HWaddr 02:42:ad:4d:02:12
          inet addr:173.77.2.18  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:adff:fe4d:212/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:5 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:438 (438.0 B)  TX bytes:1072 (1.0 KB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:1 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:112 (112.0 B)  TX bytes:112 (112.0 B)

X bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Describe the results you expected:

No network failure, ping should go through, arp should resolve.

Additional information you deem important (e.g. issue happens only occasionally):

The issue only occur for 1% of the pods when the system is under load, spawning and deleting lots of containers.
It doesn't occur in host network mode.

Output of docker version:

$ docker version
Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   7a86f89
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   7a86f89
 Built:
 OS/Arch:      linux/amd64

Output of docker info:

$ docker info
Containers: 284
 Running: 113
 Paused: 0
 Stopped: 171
Images: 30
Server Version: 1.12.1
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: selinux
Kernel Version: 4.7.3-coreos-r1
Operating System: CoreOS 1192.2.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.43 GiB
Name: pink-node-01
ID: D6TN:CC4O:MEWN:5IY4:ZOIO:OGKZ:AWPG:ZN5B:7EM6:J3JM:U6G2:6E3M
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

The text was updated successfully, but these errors were encountered:

dm0- · 2016-10-31T20:43:52Z

I've run this a few times to ping the gateway with the default bridge network driver, and there have been no ping failures.

for ((i=0; i<200; i++)) ; do (docker run --rm busybox ping -c 1 172.17.0.1 &> fail.$i && rm -f fail.$i) & done

How did you set up your custom network? Can you confirm whether the failed containers are attached to the bridge? (For example, ip link should show e.g. master docker0 on the veth interface.)

rawouter · 2016-10-31T22:26:23Z

It happens that the node I used for repro just reloaded and I couldn't reproduce the issue easily. I had to re-run several times the scripts (maybe 5000 container runs) until I found the first failures.

Nevertheless, here is ip link from a container that failed:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
From 173.77.4.23 icmp_seq=1 Destination Host Unreachable

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms


1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8274: eth0@if8275: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ad:4d:04:17 brd ff:ff:ff:ff:ff:ff

Here is one that worked:

PING 173.36.21.105 (173.36.21.105) 56(84) bytes of data.
64 bytes from 173.36.21.105: icmp_seq=1 ttl=61 time=3.71 ms

--- 173.36.21.105 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.716/3.716/3.716/0.000 ms


8192: eth0@if8193: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ad:4d:03:ed brd ff:ff:ff:ff:ff:ff
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

dm0- · 2016-11-02T22:27:30Z

Okay, I eventually managed to reproduce this with a fresh CoreOS system. However, when using an image built with a fix I've already proposed to systemd, I could not reproduce the issue (after nearly 20,000 containers spawned). We are waiting on upstream to decide on the configuration option to use in systemd/systemd#4228, but we can backport it to fix this issue when a decision is made.

dm0- · 2016-12-03T01:50:18Z

The proposed option was merged in upstream systemd, and we are going to backport it to our current systemd versions at coreos/systemd#73.

crawford added kind/bug component/docker priority/P1 team/os area/stability labels Nov 1, 2016

crawford added this to the CoreOS Alpha 1263.0.0 milestone Dec 2, 2016

dm0- self-assigned this Dec 6, 2016

dm0- mentioned this issue Dec 6, 2016

Stop networkd from interfering with Docker network interfaces coreos/coreos-overlay#2300

Merged

dm0- closed this as completed in coreos/coreos-overlay#2300 Dec 6, 2016

crawford mentioned this issue Feb 1, 2017

1235.8.0 update has intermitent Docker bridge network #1785

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No networking in some docker containers when spawned at high rate #1638

No networking in some docker containers when spawned at high rate #1638

rawouter commented Oct 31, 2016

dm0- commented Oct 31, 2016

rawouter commented Oct 31, 2016 •

edited

Loading

dm0- commented Nov 2, 2016

dm0- commented Dec 3, 2016

No networking in some docker containers when spawned at high rate #1638

No networking in some docker containers when spawned at high rate #1638

Comments

rawouter commented Oct 31, 2016

Issue Report

Bug

CoreOS Version

Environment

dm0- commented Oct 31, 2016

rawouter commented Oct 31, 2016 • edited Loading

dm0- commented Nov 2, 2016

dm0- commented Dec 3, 2016

rawouter commented Oct 31, 2016 •

edited

Loading