Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

1235.8.0 update has intermitent Docker bridge network #1785

Closed
rochacon opened this issue Feb 1, 2017 · 11 comments
Closed

1235.8.0 update has intermitent Docker bridge network #1785

rochacon opened this issue Feb 1, 2017 · 11 comments

Comments

@rochacon
Copy link

rochacon commented Feb 1, 2017

Issue Report

After upgrading to 1235.8.0, ~10% of the spawned Docker containers don't have a working network (bridge mode).

Bug

CoreOS Version

$ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1235.8.0
VERSION_ID=1235.8.0
BUILD_ID=2017-01-31-0800
PRETTY_NAME="Container Linux by CoreOS 1235.8.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

What hardware/cloud provider/hypervisor is being used to run CoreOS?

Expected Behavior

All containers should have a working bridge network.

Actual Behavior

~10% of Docker containers boot with an non-working bridge network. (--net host works 100% of the time).

Reproduction Steps

  1. Boot a new instance, may be without any cloud config.
  2. Run the following loop and watch the amount of success and failed outputs:
    while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done
  3. On 1235.6.0 the same oneliner does not raise any network errors.

Other Information

I turned Docker debug mode on, but I did not found any info that would clarify the issue.

Environment

What hardware/cloud provider/hypervisor is being used to run CoreOS?

I was able to reproduce this in AWS instance, m4.large and t2.small with the HVM AMI.
I was also able to reproduce this in an DigitalOcean droplet.

Other Information

Please, point me out any other info that I could extract from the instance.

Sample outputs:

# DigitalOcean
core@coreos-2gb-nyc3-01 ~ $ while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done | tee status
# ... truncated output ...

core@coreos-2gb-nyc3-01 ~ $ cat status | sort | uniq -c
      8 fail
    134 success
# AWS
core@ip-10-162-12-77 ~ $ while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done | tee status
# ... truncated output ...

core@ip-10-162-12-77 ~ $ cat status | sort | uniq -c
     19 fail
     41 success

I've replaced those machines several times, all of them were affected.

@kpettijohn
Copy link

I can confirm this is happening on one of my CoreOS hosts right after receiving the 1235.8.0 update.
Running CoreOS with KVM/QEMU.

Here is the output from ip addr on a working container.

133: vethc1f8b37@if132: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether 9e:21:a3:54:79:5e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::9c21:a3ff:fe54:795e/64 scope link
       valid_lft forever preferred_lft forever

Here is the output from a container with failed networking.

169: veth9a1b35b@if168: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 4e:7a:93:e8:65:43 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::4c7a:93ff:fee8:6543/64 scope link
       valid_lft forever preferred_lft forever

As you can see its missing master docker0.

Here is output from ip monitor when starting a container that ends up with failed networking.

226: vethf530543@NONE: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop master docker0 state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop master docker0 state DOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 master docker0 state DOWN
    link/ether f2:76:58:fa:5f:f3
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master docker0 state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 master docker0 state LOWERLAYERDOWN
    link/ether f2:76:58:fa:5f:f3
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master docker0 state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 master docker0 state LOWERLAYERDOWN
    link/ether f2:76:58:fa:5f:f3
Deleted 227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 master docker0 state LOWERLAYERDOWN
    link/ether f2:76:58:fa:5f:f3
delete dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
delete dev vethe17d70e lladdr f2:76:58:fa:5f:f3 PERMANENT
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue master docker0 state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@vethf530543: <NO-CARRIER,BROADCAST,MULTICAST,UP,M-DOWN> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
226: vethf530543@vethe17d70e: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
ff00::/8 dev vethf530543  table local  metric 256
fe80::/64 dev vethf530543  proto kernel  metric 256
226: vethf530543@vethe17d70e: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
226: vethf530543@vethe17d70e: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
ff00::/8 dev vethe17d70e  table local  metric 256
fe80::/64 dev vethe17d70e  proto kernel  metric 256
227: vethe17d70e@vethf530543: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
226: vethf530543@vethe17d70e: <BROADCAST,MULTICAST> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
delete ff02::16 dev vethf530543 lladdr 33:33:00:00:00:16 NOARP
Deleted fe80::/64 dev vethf530543  proto kernel  metric 256
Deleted ff00::/8 dev vethf530543  table local  metric 256
Deleted 226: vethf530543    inet6 fe80::801b:68ff:fe9b:ce8b/64 scope link tentative
       valid_lft forever preferred_lft forever
Deleted 226: vethf530543@vethe17d70e: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default
    link/ether 82:1b:68:9b:ce:8b brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@if226: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e@if226: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether f2:76:58:fa:5f:f3 brd ff:ff:ff:ff:ff:ff
227: vethe17d70e    inet6 fe80::f076:58ff:fefa:5ff3/64 scope link
       valid_lft forever preferred_lft forever

Let me know if you would like any additional information.

@crawford
Copy link
Contributor

crawford commented Feb 1, 2017

We have identified the issue (#1638 resurfaced) and have paused the automatic updates for now.

@dm0-
Copy link

dm0- commented Feb 2, 2017

Systems should begin updating to the fixed 1235.9.0 over the next day or so.

@dm0- dm0- closed this as completed Feb 2, 2017
@rochacon
Copy link
Author

rochacon commented Feb 2, 2017

I've just launched a new node with 1235.9.0 and can confirm the issue is gone. Thank you @dm0- for the quick response and fix.

@pctj101
Copy link

pctj101 commented Feb 6, 2017

This was a relatively "big" bug, knocking several of my servers offline. I didn't see any mention of 1235.8.0 having network issues for containers on the blog/web site/releases web page.

Is there any place we "broadcast" such priority/P0 problems when they arise? It could be helpful as a "first check".

@crawford
Copy link
Contributor

crawford commented Feb 7, 2017

We sent out messages to both the dev and user mailing lists and added notes to both the 1313.0.0 and 1312.0.0 release notes. I guess we could have put out a blog post.

@dm0-
Copy link

dm0- commented Feb 7, 2017

Those announcements were for verity in alpha. This bug is about Docker in stable. I believe this was announced in the release notes but not on the mailing lists.

@crawford
Copy link
Contributor

crawford commented Feb 7, 2017

Ugh, sorry. Listen to @dm0-.

@remh
Copy link

remh commented Feb 13, 2017

The issue is still present on the beta channel:

cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.3.0
VERSION_ID=1298.3.0
BUILD_ID=2017-02-02-0148
PRETTY_NAME="Container Linux by CoreOS 1298.3.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
while true; do docker run --rm alpine wget google.com 2>/dev/null && echo success || echo fail ; done
success
fail
fail
fail
fail
success
fail
fail
fail
success
success
success
success
fail
success
success
fail
success```

@dm0-
Copy link

dm0- commented Feb 13, 2017

@remh Start a container and leave it running with e.g. docker run --rm -it busybox ash -l, then run networkctl status $veth for its veth interfaces in another terminal. What is the matching .network file shown in the output? It should say 90-docker-veth.network in beta, otherwise you have custom networkd configuration files causing the issue.

@remh
Copy link

remh commented Feb 13, 2017

Thanks that was indeed the issue.
Not sure how i got into that state though.. I updated my cloud-config to match with the docker0 interface and it fixed the issue.

Thanks again!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants