1235.8.0 update has intermitent Docker bridge network #1785

rochacon · 2017-02-01T20:21:17Z

Issue Report

After upgrading to 1235.8.0, ~10% of the spawned Docker containers don't have a working network (bridge mode).

Bug

CoreOS Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.8.0
VERSION_ID=1235.8.0
BUILD_ID=2017-01-31-0800
PRETTY_NAME="Container Linux by CoreOS 1235.8.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

What hardware/cloud provider/hypervisor is being used to run CoreOS?

Expected Behavior

All containers should have a working bridge network.

Actual Behavior

~10% of Docker containers boot with an non-working bridge network. (--net host works 100% of the time).

Reproduction Steps

Boot a new instance, may be without any cloud config.
Run the following loop and watch the amount of success and failed outputs:
while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done
On 1235.6.0 the same oneliner does not raise any network errors.

Other Information

I turned Docker debug mode on, but I did not found any info that would clarify the issue.

Environment

What hardware/cloud provider/hypervisor is being used to run CoreOS?

I was able to reproduce this in AWS instance, m4.large and t2.small with the HVM AMI.
I was also able to reproduce this in an DigitalOcean droplet.

Other Information

Please, point me out any other info that I could extract from the instance.

Sample outputs:

# DigitalOcean
core@coreos-2gb-nyc3-01 ~ $ while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done | tee status
# ... truncated output ...

core@coreos-2gb-nyc3-01 ~ $ cat status | sort | uniq -c
      8 fail
    134 success

# AWS
core@ip-10-162-12-77 ~ $ while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done | tee status
# ... truncated output ...

core@ip-10-162-12-77 ~ $ cat status | sort | uniq -c
     19 fail
     41 success

I've replaced those machines several times, all of them were affected.

The text was updated successfully, but these errors were encountered:

kpettijohn · 2017-02-01T20:46:30Z

I can confirm this is happening on one of my CoreOS hosts right after receiving the 1235.8.0 update.
Running CoreOS with KVM/QEMU.

Here is the output from ip addr on a working container.

133: vethc1f8b37@if132: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether 9e:21:a3:54:79:5e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9c21:a3ff:fe54:795e/64 scope link
       valid_lft forever preferred_lft forever

Here is the output from a container with failed networking.

169: veth9a1b35b@if168: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 4e:7a:93:e8:65:43 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::4c7a:93ff:fee8:6543/64 scope link
       valid_lft forever preferred_lft forever

As you can see its missing master docker0.

Here is output from ip monitor when starting a container that ends up with failed networking.

226: vethf530543@NONE: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop master docker0 state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop master docker0 state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master docker0 state DOWN
    link/ether f2:76:58:fa:5f:f3
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master docker0 state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 master docker0 state LOWERLAYERDOWN
    link/ether f2:76:58:fa:5f:f3
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master docker0 state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 master docker0 state LOWERLAYERDOWN
    link/ether f2:76:58:fa:5f:f3
Deleted 227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 master docker0 state LOWERLAYERDOWN
    link/ether f2:76:58:fa:5f:f3
delete dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
delete dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master docker0 state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
226: vethf530543@vethe17d70e: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
ff00::/8 dev vethf530543  table local  metric 256
fe80::/64 dev vethf530543  proto kernel  metric 256
226: vethf530543@vethe17d70e: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
226: vethf530543@vethe17d70e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
ff00::/8 dev vethe17d70e  table local  metric 256
fe80::/64 dev vethe17d70e  proto kernel  metric 256
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
226: vethf530543@vethe17d70e: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
delete ff02::16 dev vethf530543 lladdr 33:33:00:00:00:16 NOARP
Deleted fe80::/64 dev vethf530543  proto kernel  metric 256
Deleted ff00::/8 dev vethf530543  table local  metric 256
Deleted 226: vethf530543    inet6 fe80::801b:68ff:fe9b:ce8b/64 scope link tentative
       valid_lft forever preferred_lft forever
Deleted 226: vethf530543@vethe17d70e: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@if226: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@if226: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e    inet6 fe80::f076:58ff:fefa:5ff3/64 scope link
       valid_lft forever preferred_lft forever

Let me know if you would like any additional information.

crawford · 2017-02-01T20:53:19Z

We have identified the issue (#1638 resurfaced) and have paused the automatic updates for now.

dm0- · 2017-02-02T04:51:38Z

Systems should begin updating to the fixed 1235.9.0 over the next day or so.

rochacon · 2017-02-02T21:09:53Z

I've just launched a new node with 1235.9.0 and can confirm the issue is gone. Thank you @dm0- for the quick response and fix.

pctj101 · 2017-02-06T09:54:27Z

This was a relatively "big" bug, knocking several of my servers offline. I didn't see any mention of 1235.8.0 having network issues for containers on the blog/web site/releases web page.

Is there any place we "broadcast" such priority/P0 problems when they arise? It could be helpful as a "first check".

crawford · 2017-02-07T20:33:43Z

We sent out messages to both the dev and user mailing lists and added notes to both the 1313.0.0 and 1312.0.0 release notes. I guess we could have put out a blog post.

dm0- · 2017-02-07T20:50:05Z

Those announcements were for verity in alpha. This bug is about Docker in stable. I believe this was announced in the release notes but not on the mailing lists.

crawford · 2017-02-07T21:11:10Z

Ugh, sorry. Listen to @dm0-.

remh · 2017-02-13T02:13:13Z

The issue is still present on the beta channel:

cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.3.0
VERSION_ID=1298.3.0
BUILD_ID=2017-02-02-0148
PRETTY_NAME="Container Linux by CoreOS 1298.3.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done
success
fail
fail
fail
fail
success
fail
fail
fail
success
success
success
success
fail
success
success
fail
success```

dm0- · 2017-02-13T02:45:05Z

@remh Start a container and leave it running with e.g. docker run --rm -it busybox ash -l, then run networkctl status $veth for its veth interfaces in another terminal. What is the matching .network file shown in the output? It should say 90-docker-veth.network in beta, otherwise you have custom networkd configuration files causing the issue.

remh · 2017-02-13T22:38:43Z

Thanks that was indeed the issue.
Not sure how i got into that state though.. I updated my cloud-config to match with the docker0 interface and it fixed the issue.

Thanks again!

dm0- mentioned this issue Feb 1, 2017

Backport the networkd Unmanaged= option coreos/coreos-overlay#2409

Merged

dm0- self-assigned this Feb 1, 2017

crawford added area/usability component/docker component/networkd kind/regression priority/P0 team/os labels Feb 1, 2017

dm0- closed this as completed Feb 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1235.8.0 update has intermitent Docker bridge network #1785

1235.8.0 update has intermitent Docker bridge network #1785

rochacon commented Feb 1, 2017

kpettijohn commented Feb 1, 2017

crawford commented Feb 1, 2017

dm0- commented Feb 2, 2017

rochacon commented Feb 2, 2017

pctj101 commented Feb 6, 2017 •

edited

Loading

crawford commented Feb 7, 2017

dm0- commented Feb 7, 2017

crawford commented Feb 7, 2017

remh commented Feb 13, 2017

dm0- commented Feb 13, 2017

remh commented Feb 13, 2017 •

edited

Loading

1235.8.0 update has intermitent Docker bridge network #1785

1235.8.0 update has intermitent Docker bridge network #1785

Comments

rochacon commented Feb 1, 2017

Issue Report

Bug

CoreOS Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

Environment

Other Information

kpettijohn commented Feb 1, 2017

crawford commented Feb 1, 2017

dm0- commented Feb 2, 2017

rochacon commented Feb 2, 2017

pctj101 commented Feb 6, 2017 • edited Loading

crawford commented Feb 7, 2017

dm0- commented Feb 7, 2017

crawford commented Feb 7, 2017

remh commented Feb 13, 2017

dm0- commented Feb 13, 2017

remh commented Feb 13, 2017 • edited Loading

pctj101 commented Feb 6, 2017 •

edited

Loading

remh commented Feb 13, 2017 •

edited

Loading