Network connectivity is quite flaky with some parallel outgoing connections #561

fingon · 2022-01-17T07:03:20Z

lima 0.8.1, default networking setup:

When running something that makes bunch of connections (per second) in other window, e.g. fetching www.google.com fails or is quite slow. In practise, lots of apps report network unreachable. It is not matter of bandwidth (the connections do not do much). Switching DNS to useHostResolver: false did not help.

mstenber@lima-f34 ~>curl https://www.google.com -o ,x
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15246    0 15246    0     0    737      0 --:--:--  0:00:20 --:--:--  3168

mstenber@lima-f34 ~>curl https://www.google.com -o ,x
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15179    0 15179    0     0  26583      0 --:--:-- --:--:-- --:--:-- 26583
^ note ~instant fetch without connections going on

mstenber@lima-f34 ~>curl https://www.google.com -o ,x
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15187    0 15187    0     0    955      0 --:--:--  0:00:15 --:--:--  3342

The text was updated successfully, but these errors were encountered:

afbjorklund · 2022-01-17T08:01:44Z

You can try if using VDE helps ?

https://github.com/lima-vm/vde_vmnet

experiencing bizarre network-related issues #537

It seems that slirp is worse on Mac.

fingon · 2022-01-17T08:10:26Z

I had similar issues (but not quite) with podman machine too. I suppose it might be even qemu issue of some kind. I'll try the vmnet, it looked bit unnecessary (as all I need is really just container -> outer world connectivity) but if it works better..

afbjorklund · 2022-01-17T08:12:30Z

Hmm, podman machine uses gvproxy

But maybe slirp still, for internet DNS ?

fingon · 2022-01-17T08:17:37Z

gvproxy seems to perform much worse in terms of speed (that's why I switched to lima to start with); the weird part is that it is most likely not (only) DNS, it seems that the result is 'network unreachable' from TCP connect().

fingon · 2022-01-17T08:50:13Z

Seems to work better with the vde.. initially it didn't though, as the slirp was preferred default route still:

mstenber@lima-f34 ~>ip route
default via 192.168.5.2 dev eth0 proto dhcp metric 100 
default via 192.168.105.1 dev lima0 proto dhcp metric 101 
192.168.5.0/24 dev eth0 proto kernel scope link src 192.168.5.15 metric 100 
192.168.105.0/24 dev lima0 proto kernel scope link src 192.168.105.2 metric 101

Needed to run following:

mstenber@lima-f34 ~>sudo ip route delete default via 192.168.5.2

The configuration I had was:

networks:
  - lima: shared

afbjorklund · 2022-01-17T10:58:20Z

Maybe VDE should be packaged for Linux as well ? https://wiki.alienbase.nl/doku.php?id=slackware:vde

There is the root requirement, but maybe if it can be narrowed down to a couple of sudo or suid perhaps.

AkihiroSuda · 2022-01-18T03:34:30Z

Maybe VDE should be packaged for Linux as well ?

No, because vde_vmnet doesn't work on Linux.

For Linux we should support TAP with qemu-ifup/qemu-ifdown.

afbjorklund · 2022-01-18T06:26:26Z

I meant VDE (not vde_vmnet), but I think it's the same TUN/TAP.

But it would be more for performance, not as much for stability.

jandubois · 2022-01-18T06:35:02Z

Much to my surprise VDE is quite slow; in my testing it was 7 times slower than the slirp user mode networking in qemu. And that was after fixing a bug in the vde_bridge code; before it was almost 350 times slower. Timings in virtualsquare/vde-2#35.

So it might be better to do TUN/TAP directly without going through VDE, but just a guess until we benchmark...

afbjorklund · 2022-01-18T06:48:09Z

Trying to remember what libvirt uses, just know it gets two eth interfaces.

Which in turn is mostly legacy from the VirtualBox setup, NAT + Host-Only.

EDIT:

-netdev tap,fd=35,id=hostnet0,vhost=on,vhostfd=36
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:a4:40:74,bus=pci.0,addr=0x2
-netdev tap,fd=37,id=hostnet1,vhost=on,vhostfd=38
-device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:2e:70:73,bus=pci.0,addr=0x3

qemu-syst 379204 libvirt-qemu   35u      CHR             10,200       0t120     140 /dev/net/tun
qemu-syst 379204 libvirt-qemu   36u      CHR             10,238         0t0     479 /dev/vhost-net
qemu-syst 379204 libvirt-qemu   37u      CHR             10,200       0t120     140 /dev/net/tun
qemu-syst 379204 libvirt-qemu   38u      CHR             10,238         0t0     479 /dev/vhost-net

Note: libvirt requires root, so it can set up all kind of virtual bridges on the host

fingon · 2022-01-18T08:06:16Z

Very non-scientific benchmark but yeah, the VDE perf is pretty bad (most recent master built for the vde*):

slirp:

Fedora-Workstation-Live-aarch64-35-1.2.iso    100% 1831MB 343.3MB/s   00:05

vde:

Fedora-Workstation-Live-aarch64-35-1.2.iso    100% 1831MB  31.6MB/s   00:57

(Both of these are to the host machine, so no real network was harmed during the test)

abiosoft · 2022-02-13T17:52:57Z

I only get the desired behaviour on Lima with nerdctl running on user-level, even though the host has same issue.

Every other permutations of the following attempts has also reproduced the behaviour.

nerdctl at system level
docker
host-resolver on/off
Alpine and Ubuntu as distros
with/without vde_vmnet

I do notice the difference in nameservers on user and system level nerdctl.

$ nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 10.0.2.3
nameserver 10.0.2.3

# nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 192.168.105.1
nameserver 192.168.5.3

jandubois · 2022-02-13T19:33:50Z

I do notice the difference in nameservers on user and system level nerdctl.

$ nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 10.0.2.3
nameserver 10.0.2.3

# nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 192.168.105.1
nameserver 192.168.5.3

I don't actually know how containerd configures DNS (inside k3s it uses coredns; not sure how it works for user mode containerd).

The entries for system mode containerd look suspicious: 192.168.105.1 looks like a nameserver on the local network. How did that make it into the VM? Did you add it via dns in lima.yaml, or did you disable the host resolver?

192.168.5.3 is either a qemu-forwarded nameserver from /etc/resolv.conf on the host, or the lima host resolver (we override the qemu one using iptables).

Either way, you (we) should not have both nameservers in there, as they are no equivalent. Multiple nameservers in /etc/resolv.conf are supported for high availability: if one server is unreachable, the resolver will try another one. But they are all supposed to be interchangable, all resolving all names the same way. There is no fallback from one nameserver to the other. If the first one responds with "domain not found", then that is the final answer.

FWIW, I only get the single expected nameserver when using system mode containerd:

jan@lima-default:~$ sudo nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 192.168.5.3

abiosoft · 2022-02-13T19:51:30Z

FWIW, I only get the single expected nameserver when using system mode containerd:
jan@lima-default:~$ sudo nerdctl run --rm -it alpine -- cat /etc/resolv.conf
nameserver 192.168.5.3

Sorry for the confusion, that is with vmnet shared network enabled. Without that, it is same as yours.

jandubois · 2022-02-13T20:23:44Z

I see, thanks! That will be a potential source of problems:

jan@lima-default:~$ sudo nerdctl run --rm -it alpine -- cat /etc/resolv.conf
search home
nameserver 192.168.5.3
nameserver 192.168.105.1
jan@lima-default:~$ resolvectl status
Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.5.3
       DNS Servers: 192.168.5.3

Link 3 (lima0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.105.1
       DNS Servers: 192.168.105.1
        DNS Domain: home

Link 4 (nerdctl0)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

It does pick up a search domain from the host, but the simplistic /etc/resolv.conf mechanism doesn't support split-DNS, so all requests should really just go to 127.0.0.53 and the other nameserver should not be listed as a fallback.

The /etc/resolv.conf in the VM is correct though:

jan@lima-default:~$ cat /etc/resolv.conf
[...]
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
[...]
nameserver 127.0.0.53
options edns0 trust-ad
search home

Again, I don't know how nerdctl and/or containerd configure DNS, and why it picks up the incorrect second nameserver.

But you said you get the flaky networking even without the additional vde_vmnet network (and DNS), so it should be immaterial for this issue.

scbizu · 2022-02-21T18:17:12Z

I do not know whether I should append in this issue , but in my case it do nothing with the DNS resolver , even with the pure IP:Port, the network issue seems existed .

In my case , I use lima nerdctl compose up to start 5 services at the same time , and this 5 services both depend on the outside register and discovery service (this is my outgoing traffic,through ip:port) to start . Things get weird that if we start 5 service at the same time , they can not start due to the disconnection or time-out with the outside register service but if we reduce the service to 1 or 2 , it will start occasionally , after some times of retry , even the curl response with timeout .

After I install the vde , the timeout issue goes on . But when I use lima nerdctl run to start container one by one , they can start with no problems . It seems the issue only will appear when I change the stack to lima nerdctl compose , that will request network all at one time . Maybe I can do some network monitoring to track the package size at the start time .

MlsDmitry · 2022-03-30T18:17:27Z

I used raw qemu before and tested several options for networking. I found that vde is really slow compared to user-mode networking. I though we could attach two interfaces to vm: the user-mode device for out traffic and a vde for a connection between vms(if needed).

abiosoft · 2022-03-30T18:54:50Z

I used raw qemu before and tested several options for networking. I found that vde is really slow compared to user-mode networking. I though we could attach two interfaces to vm: the user-mode device for out traffic and a vde for a connection between vms(if needed).

I tried out PTP mode and it seems to be more stable for situations like #561 (comment). The download speed is okay and there is also a report of a better dns performance (when used for dns).

But there are still two main issues.

the upload speed is 90% slower.
it appears to be incompatible with vpn connections. There are multiple reports of unsuccessful outgoing requests when vpn connection is active.

It looks like the safest best is still your recommendation i.e. to limit vde to providing reachable address to the vm and retaining the user-mode for normal traffic.

saamalik · 2022-05-14T23:49:25Z

Hi all - is this the root cause for the Slirp issues: https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35 ?

abiosoft · 2022-05-15T04:16:18Z

Hi all - is this the root cause for the Slirp issues: https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35 ?

that looks like it

AkihiroSuda · 2022-05-15T04:58:26Z

Hi all - is this the root cause for the Slirp issues: https://gitlab.freedesktop.org/slirp/libslirp/-/issues/35 ?

Seems already fixed in libslirp v4.5.0 (18 May, 2021): https://gitlab.freedesktop.org/slirp/libslirp/-/merge_requests/73

abiosoft · 2022-05-15T10:51:11Z

Seems already fixed in libslirp v4.5.0 (18 May, 2021): https://gitlab.freedesktop.org/slirp/libslirp/-/merge_requests/73

I saw that, however the issue described is near identical to the current issue as well.

fingon · 2023-01-18T04:55:26Z

This particular one was fixed, but follow-up problem surfaced in 0.12 ( see #1285 ). Closing this though.

AkihiroSuda added the component/qemu QEMU label Jan 18, 2022

mjkonarski-b mentioned this issue Jan 27, 2022

Network in containers breaks under bigger network load abiosoft/colima#140

Closed

AkihiroSuda added expert help wanted Extra attention is needed priority/high labels Feb 13, 2022

abiosoft mentioned this issue Mar 31, 2022

Upload speeds are significantly slower with most recent changes in main abiosoft/colima#232

Closed

fingon closed this as completed Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network connectivity is quite flaky with some parallel outgoing connections #561

Network connectivity is quite flaky with some parallel outgoing connections #561

fingon commented Jan 17, 2022

afbjorklund commented Jan 17, 2022 •

edited

Loading

fingon commented Jan 17, 2022

afbjorklund commented Jan 17, 2022 •

edited

Loading

fingon commented Jan 17, 2022

fingon commented Jan 17, 2022

afbjorklund commented Jan 17, 2022 •

edited

Loading

AkihiroSuda commented Jan 18, 2022

afbjorklund commented Jan 18, 2022 •

edited

Loading

jandubois commented Jan 18, 2022

afbjorklund commented Jan 18, 2022 •

edited

Loading

fingon commented Jan 18, 2022

abiosoft commented Feb 13, 2022 •

edited

Loading

jandubois commented Feb 13, 2022

abiosoft commented Feb 13, 2022

jandubois commented Feb 13, 2022 •

edited

Loading

scbizu commented Feb 21, 2022 •

edited

Loading

MlsDmitry commented Mar 30, 2022

abiosoft commented Mar 30, 2022

saamalik commented May 14, 2022

abiosoft commented May 15, 2022

AkihiroSuda commented May 15, 2022

abiosoft commented May 15, 2022

fingon commented Jan 18, 2023

Network connectivity is quite flaky with some parallel outgoing connections #561

Network connectivity is quite flaky with some parallel outgoing connections #561

Comments

fingon commented Jan 17, 2022

afbjorklund commented Jan 17, 2022 • edited Loading

fingon commented Jan 17, 2022

afbjorklund commented Jan 17, 2022 • edited Loading

fingon commented Jan 17, 2022

fingon commented Jan 17, 2022

afbjorklund commented Jan 17, 2022 • edited Loading

AkihiroSuda commented Jan 18, 2022

afbjorklund commented Jan 18, 2022 • edited Loading

jandubois commented Jan 18, 2022

afbjorklund commented Jan 18, 2022 • edited Loading

fingon commented Jan 18, 2022

abiosoft commented Feb 13, 2022 • edited Loading

jandubois commented Feb 13, 2022

abiosoft commented Feb 13, 2022

jandubois commented Feb 13, 2022 • edited Loading

scbizu commented Feb 21, 2022 • edited Loading

MlsDmitry commented Mar 30, 2022

abiosoft commented Mar 30, 2022

saamalik commented May 14, 2022

abiosoft commented May 15, 2022

AkihiroSuda commented May 15, 2022

abiosoft commented May 15, 2022

fingon commented Jan 18, 2023

afbjorklund commented Jan 17, 2022 •

edited

Loading

afbjorklund commented Jan 17, 2022 •

edited

Loading

afbjorklund commented Jan 17, 2022 •

edited

Loading

afbjorklund commented Jan 18, 2022 •

edited

Loading

afbjorklund commented Jan 18, 2022 •

edited

Loading

abiosoft commented Feb 13, 2022 •

edited

Loading

jandubois commented Feb 13, 2022 •

edited

Loading

scbizu commented Feb 21, 2022 •

edited

Loading