Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mptcp vs tcp performance over long fat networks #437

Closed
daire-byrne opened this issue Sep 5, 2023 · 20 comments
Closed

mptcp vs tcp performance over long fat networks #437

daire-byrne opened this issue Sep 5, 2023 · 20 comments
Assignees
Labels

Comments

@daire-byrne
Copy link

Hi,

I have a question about tcp vs mptcp performance. I have read previous issues on this subject like #307 (#332 & #345) and I have a rudimentary understanding of the problems of HOL blocking and the extra buffer requirements (window sizes) for mptcp.

Even still, could someone perhaps have a peek at some of the performance comparisons I have been doing and verify if the performance I am seeing makes sense within the current known mptcp limitations?

My setup is as described in (#430) where we have serverA and serverB with three connection pairs between them - one to initiate connections and be the backup, and the other two which use different ISPs to route between them.

# serverA
ip mptcp endpoint flush
ip mptcp endpoint add 10.11.20.251 id 1 backup dev ens192
ip mptcp endpoint add 10.11.21.251 id 2 signal dev ens224
ip mptcp endpoint add 10.11.22.251 id 3 signal dev ens256
ip mptcp limits set subflow 4 add_addr_accepted 4

ip route add 10.15.21.251 via 10.11.21.1 dev ens224
ip route add 10.15.22.251 via 10.11.22.1 dev ens256

# serverB
ip mptcp endpoint flush
ip mptcp endpoint add 10.15.20.251 id 1 backup dev ens192
ip mptcp endpoint add 10.15.21.251 id 2 signal dev ens224
ip mptcp endpoint add 10.15.22.251 id 3 signal dev ens256
ip mptcp limits set subflow 4 add_addr_accepted 4

ip route add 10.11.21.251 via 10.15.21.1 dev ens224
ip route add 10.11.22.251 via 10.15.22.1 dev ens256

Unlike in #307, these bulk transfer pairs are long fat networks (multi gbit) with ~281ms of latency each (london - sydney). So in order to achieve good bulk transfer speeds we crank up the buffers accordingly. Our relevant sysctl settings at both ends:

net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq

# high latency bulk transfers should use bigger window sizes
net.ipv4.tcp_rmem = 8192 262144 536870912
net.ipv4.tcp_wmem = 4096 16384 536870912
net.ipv4.tcp_adv_win_scale = -2
net.ipv4.tcp_notsent_lowat = 131072
    
# override the default memory to tcp_mem ratio
# host with 96G RAM = 64G max TCP mem (4 times the default)
net.ipv4.tcp_mem = 8388608 12582912 16777216

net.core.netdev_max_backlog = 30000
net.core.somaxconn = 16384
net.ipv4.tcp_max_syn_backlog = 512
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1

So we should be able to scale to a max window size of 128MB per connection all being well.

Now, if we run single stream iperf3 tests simultaneously across both our paired connections (ens225 & ens256) using just TCP, we get an idea of the sustained performance possible:

serverA # iperf3 -t30 -O5 -c serverB-ens224 -P1 -p 5201 &; iperf3 -t30 -O5 -c serverB-ens256 -P1 -p 5202
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  6.43 GBytes  1.84 Gbits/sec  1174             sender
[  5]   0.00-30.28  sec  6.47 GBytes  1.84 Gbits/sec                  receiver
[  5]   0.00-30.00  sec  11.7 GBytes  3.34 Gbits/sec  11989             sender
[  5]   0.00-30.28  sec  11.8 GBytes  3.34 Gbits/sec                  receiver

serverA # ss -neimMtp | grep iperf -A1 (taken while running iperfs)
tcp   ESTAB      0      75733812   10.11.21.251:54412 10.15.21.251:5201  users:(("iperf3",pid=2703303,fd=5)) timer:(on,039ms,0) uid:4078 ino:46766217 sk:407d cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t394758,tb536870912,f716,w77364532,o0,bl0,d0) ts sack bbr wscale:14,14 rto:485 rtt:284.421/0.383 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:128303 ssthresh:57468 bytes_sent:4426057769 bytes_retrans:750086 bytes_acked:4349640400 segs_out:3193417 segs_in:308754 data_segs_out:3193415 bbr:(bw:2235754024bps,mrtt:281,pacing_gain:1.25,cwnd_gain:2.25) send 5001823578bps lastrcv:22539 lastack:1 pacing_rate 2766745608bps delivery_rate 1906787184bps delivered:3143847 busy:22257ms rwnd_limited:2136ms(9.6%) unacked:54594 retrans:0/542 sacked:5567 reord_seen:3 rcv_space:14480 rcv_ssthresh:64088 notsent:66528 minrtt:281 snd_wnd:134217728 
 
tcp   ESTAB      0      133388042  10.11.22.251:40972 10.15.22.251:5202  users:(("iperf3",pid=2703304,fd=5)) timer:(on,481ms,0) uid:4078 ino:46760619 sk:4081 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t64652,tb536870912,f2646,w136645576,o0,bl4480,d0) ts sack bbr wscale:14,14 rto:481 rtt:280.653/0.08 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:225343 bytes_sent:7937358673 bytes_retrans:7506576 bytes_acked:7796553548 segs_out:5726876 segs_in:654005 data_segs_out:5726874 bbr:(bw:3827935824bps,mrtt:280.379,pacing_gain:1.25,cwnd_gain:2.25) send 8902820152bps lastrcv:22545 pacing_rate 4737070584bps delivery_rate 3827201936bps delivered:5625283 app_limited busy:22262ms rwnd_limited:4601ms(20.7%) unacked:96176 retrans:0/5416 reordering:28 reord_seen:16 rcv_space:14480 rcv_ssthresh:64088 notsent:68702 minrtt:280 snd_wnd:134201344

So around 1.84 + 3.34 = 5.68gbit of aggregate performance. The max possible with the ens224 pair is ~2.5gbit and for the ens256 pair it's ~7gbit. It is possible to get closer to the later if we again double the buffer to 256MB but more on that later.

Now, let's compare with a single stream of mptcp:

serverA # mptcpize run iperf3 -t30 -O4 -c serverB-ens192 -P1
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  5.96 GBytes  1.71 Gbits/sec    0             sender
[  5]   0.00-30.28  sec  5.76 GBytes  1.63 Gbits/sec                  receiver
 
serverA # ss -neimMtp | grep iperf -A1 (taken while running iperf)
tcp   ESTAB      0      30943542  10.11.21.251:59509 10.15.21.251:5201  users:(("iperf3",pid=2715990,fd=5)) timer:(on,482ms,0) uid:4078 ino:46849656 sk:803e cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t64910,tb536870912,f2378,w31528630,o0,bl0,d0) ts sack bbr wscale:14,14 rto:482 rtt:281.799/0.016 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:49416 ssthresh:49109 bytes_sent:4011580334 bytes_retrans:3094942 bytes_acked:3977615799 segs_out:2961097 segs_in:1341139 data_segs_out:2961095 bbr:(bw:1933569768bps,mrtt:281,pacing_gain:1.25,cwnd_gain:2.25) send 1944380952bps lastsnd:1 lastrcv:32027 pacing_rate 2392792584bps delivery_rate 1933569736bps delivered:2936299 app_limited busy:31744ms sndbuf_limited:2933ms(9.2%) unacked:22786 retrans:0/2278 dsack_dups:267 rcv_space:14480 rcv_ssthresh:64088 notsent:71224 minrtt:281 snd_wnd:134234112  
 
tcp   ESTAB      0      10827200  10.11.22.251:53597 10.15.22.251:5201  users:(("iperf3",pid=2715990,fd=5)) timer:(on,481ms,0) uid:4078 ino:46849656 sk:8035 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb334402560,f4032,w11087936,o0,bl0,d0) ts sack bbr wscale:14,14 rto:481 rtt:280.786/0.077 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:17858 ssthresh:20935 bytes_sent:1965662024 bytes_retrans:4581046 bytes_acked:1950386133 segs_out:1456033 segs_in:734665 data_segs_out:1456031 bbr:(bw:828479776bps,mrtt:280.391,pacing_gain:1,cwnd_gain:2) send 705197211bps lastsnd:1 lastrcv:31746 pacing_rate 820194976bps delivery_rate 828384952bps delivered:1444985 app_limited busy:31463ms sndbuf_limited:12946ms(41.1%) unacked:7969 retrans:0/3369 dsack_dups:291 rcv_space:14480 rcv_ssthresh:64088 notsent:132354 minrtt:280 snd_wnd:134234112   
 
mptcp ESTAB      0      0         10.11.20.251:47380 10.15.20.251:5201  users:(("iperf3",pid=2715990,fd=5)) uid:4078 ino:46849656 sk:8040 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb536870912,f3472,w358859376,o0,bl0,d0) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:64d47401 write_seq:f9eb350931218a1f snd_una:f9eb35091bc47717 rcv_nxt:ffd0460b61af4c61

The standout differences on the TCP subflows is the higher retrans, lower cwnd and the appearance of the sndbuf_limited (41%) field. It is also worth noting that there is very little latency difference between both paths (280 vs 281ms).

My question is, does this seem like one of the known issues with the current mptcp (packet scheduler)? I might read this as saying we are being HOL blocked and/or re-transmitting such that iperf is delayed and is then not filling the send buffer?

Now, in production we have many streams of mptcp active at the same time and they do a much better job of filling the pipes but I am also interested in this worst case single stream performance difference too.

It is also worth noting that further increasing the buffer and max window size to 256MB, further improves the fatter ens256 performance up to 6.5 gbit for a single stream, but does little to improve the mptcp case. We just don't seem to be TCP buffer/window limited by that point.

Anyway, this is all very far down the list of important and great things youz do, so feel free to mark as such - it's mainly to satisfy my own curiosity and interest.

I should also say I have never tried the out-of-tree mptcp implementation but I am led to believe it has more scheduling options which better close the performance gap on aggregated individual TCP vs MPTCP conjoined subflows.

@daire-byrne
Copy link
Author

And the output of nstat before/after the mptcp iperf test (in case it's of interest):

#kernel
IpInReceives                    1932604            0.0
IpInDelivers                    1932604            0.0
IpOutRequests                   159951             0.0
IpFragOKs                       64                 0.0
IpFragCreates                   128                0.0
IcmpInMsgs                      759                0.0
IcmpInEchos                     360                0.0
IcmpInEchoReps                  399                0.0
IcmpOutMsgs                     758                0.0
IcmpOutEchos                    398                0.0
IcmpOutEchoReps                 360                0.0
IcmpMsgInType0                  399                0.0
IcmpMsgInType8                  360                0.0
IcmpMsgOutType0                 360                0.0
IcmpMsgOutType8                 398                0.0
TcpActiveOpens                  313                0.0
TcpPassiveOpens                 2                  0.0
TcpEstabResets                  241                0.0
TcpInSegs                       1931831            0.0
TcpOutSegs                      4337183            0.0
TcpRetransSegs                  7317               0.0
TcpOutRsts                      242                0.0
UdpInDatagrams                  4                  0.0
UdpOutDatagrams                 502                0.0
UdpIgnoredMulti                 10                 0.0
Ip6OutRequests                  3                  0.0
Ip6OutMcastPkts                 3                  0.0
Ip6OutOctets                    160                0.0
Ip6OutMcastOctets               160                0.0
Icmp6OutMsgs                    3                  0.0
Icmp6OutRouterSolicits          3                  0.0
Icmp6OutType133                 3                  0.0
TcpExtTW                        7                  0.0
TcpExtDelayedACKs               333                0.0
TcpExtTCPHPHits                 24185              0.0
TcpExtTCPPureAcks               1852896            0.0
TcpExtTCPHPAcks                 24298              0.0
TcpExtTCPSackRecovery           26                 0.0
TcpExtTCPFastRetrans            7316               0.0
TcpExtTCPLossProbes             2                  0.0
TcpExtTCPLossProbeRecovery      1                  0.0
TcpExtTCPBacklogCoalesce        8772               0.0
TcpExtTCPDSACKRecv              671                0.0
TcpExtTCPAbortOnData            238                0.0
TcpExtTCPDSACKIgnoredOld        182                0.0
TcpExtTCPSackShifted            7315               0.0
TcpExtTCPSackMerged             14120              0.0
TcpExtTCPSackShiftFallback      94787              0.0
TcpExtTCPRcvCoalesce            180                0.0
TcpExtTCPAutoCorking            42564              0.0
TcpExtTCPOrigDataSent           4331525            0.0
TcpExtTCPKeepAlive              13                 0.0
TcpExtTCPDelivered              4289040            0.0
TcpExtTCPDSACKRecvSegs          671                0.0
IpExtInBcastPkts                10                 0.0
IpExtInOctets                   126034726          0.0
IpExtOutOctets                  5852695699         0.0
IpExtInBcastOctets              2863               0.0
IpExtInNoECTPkts                1932604            0.0
MPTcpExtMPCapableSYNTX          2                  0.0
MPTcpExtMPCapableSYNACKRX       2                  0.0
MPTcpExtMPTCPRetrans            2                  0.0
MPTcpExtMPJoinSynAckRx          4                  0.0
MPTcpExtDuplicateData           3                  0.0
MPTcpExtAddAddr                 4                  0.0
MPTcpExtEchoAdd                 4                  0.0
MPTcpExtMPPrioTx                2                  0.0
MPTcpExtMPPrioRx                2                  0.0
MPTcpExtMPFastcloseRx           3                  0.0
MPTcpExtMPRstRx                 3                  0.0
MPTcpExtSndWndShared            23                 0.0
MPTcpExtRcvWndShared            13                 0.0

@pabeni
Copy link

pabeni commented Sep 6, 2023

Thanks for the detailed report!

The standout differences on the TCP subflows is the higher retrans, lower cwnd and the appearance of the sndbuf_limited (41%) field. It is also worth noting that there is very little latency difference between both paths (280 vs 281ms).

As a wild guess, I think there are 2 main root causes:

  • the mptcp packet scheduler takes sub-optimal decisions when there are multiple subflows and one one of them is significantly worse then the others (as noted in issue TCP performance better than MPTCP #307)
  • there is some bug in subflows autotuning so that the one on top of 10.11.22.251/ens256 get a smaller sendbuf, leading to (very) lower perceived available b/w and triggering the above.

One dirty hack to validate the above could possibly be setting both tcp_wmem, tcp_rmem to:
8192 536870912 536870912

WARNING!!! only doable if <max num simult tcp connections> is less then <tcp_memory_pressure threshold> / <tcp memory per socket> (should be 96 with the above settings)

The above actually triggers some follow-up questions:

  • how many simult tcp connections/mptcp subflows are there in the system?
  • can you observe the TCP memory pressure ?

@pabeni
Copy link

pabeni commented Sep 6, 2023

there is some bug in subflows autotuning so that the one on top of 10.11.22.251/ens256 get a smaller sendbuf, leading to (very) lower perceived available b/w and triggering the above.

It looks like we have at least a functional problem there: mptcp socket wmem is limited to max(subflows wmem), but in this scenario (and probably in any scenarios where the tcp subflow is potentially sndbuf limited) we actually need mptcp wmem to be sum(subflows wmem).
I think this point can't be easily addressed using an existing sysctl knob - unless you have a huge amount of memory to use and you can set the per socket TCP limits to more then the double of what required by B/W-delay formula - and then you have to confront the results with plain TCP and the half of such value as a limit.

@pabeni
Copy link

pabeni commented Sep 8, 2023

addendum: experimenting in a simulated env (multiple veths + netem) with a similar configuration, it looks like using the bbr GC affect badly the mptcp performances, e.g, with a single subflow:

# ip netns exec src mptcpize run taskset 0x2 iperf3 -c 192.168.2.3 -t 1000
Connecting to host 192.168.2.3, port 5201
[  5] local 192.168.2.2 port 33082 connected to 192.168.2.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   256 KBytes  2.09 Mbits/sec    0   31.1 KBytes       
[  5]   1.00-2.00   sec  1.12 MBytes  9.43 Mbits/sec    0    164 KBytes       
[  5]   2.00-3.00   sec  6.50 MBytes  54.5 Mbits/sec    0   1.03 MBytes       
[  5]   3.00-4.00   sec  42.5 MBytes   357 Mbits/sec    0   7.07 MBytes       
[  5]   4.00-5.00   sec   289 MBytes  2.42 Gbits/sec    0   48.4 MBytes       
[  5]   5.00-6.00   sec   262 MBytes  2.20 Gbits/sec    0    265 MBytes       
[  5]   6.00-7.00   sec   256 MBytes  2.15 Gbits/sec    0    219 MBytes       
[  5]   7.00-8.00   sec   261 MBytes  2.19 Gbits/sec    0    219 MBytes       
[  5]   8.00-9.00   sec   286 MBytes  2.40 Gbits/sec    0    219 MBytes       
[  5]   9.00-10.00  sec   335 MBytes  2.81 Gbits/sec    0    250 MBytes       
[  5]  10.00-11.00  sec   320 MBytes  2.68 Gbits/sec    0    250 MBytes       
[  5]  11.00-12.00  sec   330 MBytes  2.77 Gbits/sec    0    250 MBytes       
[  5]  12.00-13.00  sec   326 MBytes  2.74 Gbits/sec    0    249 MBytes
# so far (almost) so good, this is in line with plain TCP perf. At this point BBR enters the PROBE_RTT state, cutting the b/w for short period. With plain TCP it recovers within the same second, with mptcp:
[  5]  13.00-14.00  sec   230 MBytes  1.93 Gbits/sec    0   5.66 KBytes       
[  5]  14.00-15.00  sec  1.25 MBytes  10.5 Mbits/sec    0    245 MBytes       
[  5]  15.00-16.00  sec  2.50 MBytes  21.0 Mbits/sec    0    243 MBytes       
[  5]  16.00-17.00  sec  5.00 MBytes  42.0 Mbits/sec    0    242 MBytes       
[  5]  17.00-18.00  sec  6.25 MBytes  52.4 Mbits/sec    0   5.82 MBytes       
[  5]  18.00-19.00  sec  1.25 MBytes  10.5 Mbits/sec    0   5.95 MBytes       
[  5]  19.00-20.00  sec  3.75 MBytes  31.5 Mbits/sec    0   5.41 MBytes       
[  5]  20.00-21.00  sec  5.00 MBytes  41.9 Mbits/sec    0   5.46 MBytes

@daire-byrne
Copy link
Author

Thanks for considering it @pabeni.

I tried with "8192 536870912 536870912" but there was no noticeable improvement in performance for the mptcp case.

When I tried with "8192 536870912 1073741824" (256G max window), there was a slight increase from 1.6gbit -> 2.25 gbit for mptcp. But then the ens256 route using just TCP also increased from ~3.4gbit -> 5.6gbit, so mptcp is still falling behind relatively speaking.

I have not observed the BBR degradation you are seeing but then (confusingly) I am also using BBRv3 which is scheduled to replace the current BBR implementation in the kernel (patches from: https://gitlab.com/xanmod/linux-patches/-/tree/master/linux-6.4.y-xanmod/net/tcp/bbr3).

In our environment using "real world" multi-gigabit internet connections, cubic congestion just can't get above around 500mbit for a single stream of TCP over 280ms. With BBRv3, we can sustain ~6gbit with the same 256M window.

I think we are already fine with throwing lots of TCP memory at the problem of long fat network transfers so that is not really a limitation for us.

A back of the envelope calculation suggests that the absolute maximum mptcp subflows we would have open at once would be less than 200. And then add another 100 for various things like NFS connections (nconnect=8) and we probably max out at 300 connections per server (worst case). Which would equate to around 72G of TCP memory if we were using the default net.ipv4.tcp_adv_win_scale.

We already use 96G servers and set the max TCP mem to 64G and I have never seen the memory used go above a few gigabytes (according to /proc/net/sockstat) for any period of time. But this is also likely because we are not able to fully open the windows with the mptcp connections atm.

@pabeni
Copy link

pabeni commented Sep 11, 2023

net.ipv4.tcp_notsent_lowat = 131072

In my veth-based testbed, the above sysctl is the root cause for the very bad performances I'm observing for MPTCP with the BBR cong algo.

Removing such setting also removed the huge b/w degradation I was observing after 15" and also makes the initial ramp-up to full b/w faster. Why do you have such setting in your setup? Could you please experiment with such setting off/default tcp_notsent_lowat (4294967295) ?

Note: the old comment wrt mptcp wmem autotuning is still relevant, addressing that will require some patch and more testing

@matttbe matttbe added the bug label Sep 12, 2023
@daire-byrne
Copy link
Author

net.ipv4.tcp_notsent_lowat = 131072

Ah right, so I think that came from some testing I was doing at some point of a cloudflare "tcp collapse" patch based on this blog:

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency

I clearly forgot to remove it.... sorry! I actually just quickly reapplied that patch and tested with "net.ipv4.tcp_collapse_max_bytes=6291456" but the results were as before. It does seem like BBR does not like low tcp_notsent_lowat... But I still only only saw degradation on mptcp flows with low tcp_notsent_lowat not standalone tcp flows.

While I was not able to exactly replicate the extreme BBR performance degradation you see, I set tcp_notsent_lowat=4294967295 (maximum) anyway, and now I do indeed see improvement in the mptcp case:

serverA # mptcpize run iperf3 -t30 -O4 -c serverB-ens192 -P1
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  10.6 GBytes  3.04 Gbits/sec    0             sender
[  5]   0.00-30.28  sec  10.7 GBytes  3.05 Gbits/sec                  receiver

serverA # ss -neimMtp | grep iperf -A1 (taken while running iperf)
tcp   ESTAB      0      90787220  10.11.22.251:54001 10.15.22.251:5201  users:(("iperf3",pid=2702075,fd=5)) timer:(on,484ms,0) uid:4078 ino:46853834 sk:700c cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t63668,tb536870912,f1900,w93018260,o0,bl0,d0) ts sack bbr wscale:14,14 rto:484 rtt:283.345/0.035 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:74089 bytes_sent:3818087706 bytes_retrans:7200370 bytes_acked:3724429511 segs_out:2827221 segs_in:1271745 data_segs_out:2827219 bbr:(bw:2785266832bps,mrtt:281.923,pacing_gain:1.25,cwnd_gain:2.25) send 2899288260bps lastrcv:20480 pacing_rate 3446767704bps delivery_rate 2411538872bps delivered:2758157 app_limited busy:20188ms sndbuf_limited:7758ms(38.4%) unacked:63869 retrans:0/5304 dsack_dups:110 reordering:155 rcv_space:14480 rcv_ssthresh:64088 notsent:4329394 minrtt:281.92 snd_wnd:134217728

tcp   ESTAB      0      41451204  10.11.21.251:58911 10.15.21.251:5201  users:(("iperf3",pid=2702075,fd=5)) timer:(on,481ms,0) uid:4078 ino:46853834 sk:7008 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb536870912,f2876,w42816708,o0,bl0,d0) ts sack bbr wscale:14,14 rto:482 rtt:281.917/0.067 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:32283 ssthresh:53917 bytes_sent:2818257690 bytes_retrans:7531988 bytes_acked:2769274499 segs_out:2097743 segs_in:952028 data_segs_out:2097741 bbr:(bw:1495882600bps,mrtt:281,pacing_gain:1,cwnd_gain:2) send 1269713795bps lastsnd:2 lastrcv:20763 lastack:2 pacing_rate 1480923768bps delivery_rate 1268401072bps delivered:2061291 busy:20480ms sndbuf_limited:5269ms(25.7%) unacked:30978 retrans:0/5561 dsack_dups:88 reordering:41 rcv_space:14480 rcv_ssthresh:64088 minrtt:281 snd_wnd:134217728

It's still half of the ~5.68gbit aggregate I can get with each path but its almost twice what I was getting before. Which is probably about right if the TCP window size (128M) is split between the two paths? I still see quite high sndbuf_limited on each of the tcp connections though.

Then I retested your suggestion of tcp_[rw]mem="8192 536870912 536870912" and again I now see slightly better performance. In particular, it starts off bouncing around a bit for the first 4 seconds, but then the iperf settles down to a steady ~3.5gbit (up from 3.0gbit with default auto tuning).

tcp   ESTAB      0      51378060  10.11.21.251:55915 10.15.21.251:5201  users:(("iperf3",pid=2759403,fd=5)) timer:(on,201ms,0) uid:4078 ino:47232326 sk:500b cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t64910,tb536870912,f1908,w53168268,o0,bl0,d0) ts sack bbr wscale:14,14 rto:482 rtt:281.813/0.012 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:119597 ssthresh:57654 bytes_sent:4237409342 bytes_retrans:1102374 bytes_acked:4185309345 segs_out:3154095 segs_in:1521338 data_segs_out:3154093 bbr:(bw:2280577336bps,mrtt:281,pacing_gain:1,cwnd_gain:2) send 4705572617bps lastrcv:25233 pacing_rate 2257771560bps delivery_rate 1738169928bps delivered:3115301 app_limited busy:24951ms sndbuf_limited:6234ms(25.0%) unacked:38048 retrans:0/817 dsack_dups:72 reordering:41 rcv_space:14480 rcv_ssthresh:2990080 notsent:380436 minrtt:281 snd_wnd:134234112                       

tcp   ESTAB      0      82856052  10.11.22.251:40427 10.15.22.251:5201  users:(("iperf3",pid=2759403,fd=5)) timer:(on,200ms,0) uid:4078 ino:47232326 sk:500c cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t64910,tb536870912,f2956,w84956276,o0,bl0,d0) ts sack bbr wscale:14,14 rto:481 rtt:280.737/0.013 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:73619 bytes_sent:5428860602 bytes_retrans:11510628 bytes_acked:5334993217 segs_out:4019003 segs_in:1927606 data_segs_out:4019001 bbr:(bw:2915298176bps,mrtt:280.366,pacing_gain:1.25,cwnd_gain:2.25) send 2907659026bps lastrcv:24953 pacing_rate 3607681496bps delivery_rate 2422134192bps delivered:3949794 app_limited busy:24669ms rwnd_limited:189ms(0.8%) sndbuf_limited:8847ms(35.9%) unacked:60890 retrans:0/8504 dsack_dups:186 reordering:155 rcv_space:14480 rcv_ssthresh:2990080 notsent:499294 minrtt:280 snd_wnd:134234112

If I again double the max window (net.ipv4.tcp_adv_win_scale = -1), I can now get 4.5gbit with mptcp which I was not able to do before.

So it seems like 80% of my woes were self inflicted (tcp_notsent_lowat) and 20% is the mptcp wmem autotuning.

I really need to stop reading cloudflare blog posts... thanks for walking me through it.

@pabeni
Copy link

pabeni commented Sep 12, 2023

net.ipv4.tcp_notsent_lowat = 131072

Ah right, so I think that came from some testing I was doing at some point of a cloudflare "tcp collapse" patch based on this blog:

https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency

I clearly forgot to remove it.... sorry! I actually just quickly reapplied that patch and tested with "net.ipv4.tcp_collapse_max_bytes=6291456" but the results were as before. It does seem like BBR does not like low tcp_notsent_lowat... But I still only only saw degradation on mptcp flows with low tcp_notsent_lowat not standalone tcp flows.

I think the problem can be addresses at the mptcp impl. level, basically we need to ignore subflow level tcp_notsent_lowat, and use it only at the msk/mptcp socket level. The subflow level tcp_notsent_lowat easily fools the mptcp scheduler to take suboptimal decisions and does not really directly related with the latency observed by the sender.

While I was not able to exactly replicate the extreme BBR performance degradation you see, I set tcp_notsent_lowat=4294967295 (maximum) anyway, and now I do indeed see improvement in the mptcp case:

serverA # mptcpize run iperf3 -t30 -O4 -c serverB-ens192 -P1
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  10.6 GBytes  3.04 Gbits/sec    0             sender
[  5]   0.00-30.28  sec  10.7 GBytes  3.05 Gbits/sec                  receiver

serverA # ss -neimMtp | grep iperf -A1 (taken while running iperf)
tcp   ESTAB      0      90787220  10.11.22.251:54001 10.15.22.251:5201  users:(("iperf3",pid=2702075,fd=5)) timer:(on,484ms,0) uid:4078 ino:46853834 sk:700c cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t63668,tb536870912,f1900,w93018260,o0,bl0,d0) ts sack bbr wscale:14,14 rto:484 rtt:283.345/0.035 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:74089 bytes_sent:3818087706 bytes_retrans:7200370 bytes_acked:3724429511 segs_out:2827221 segs_in:1271745 data_segs_out:2827219 bbr:(bw:2785266832bps,mrtt:281.923,pacing_gain:1.25,cwnd_gain:2.25) send 2899288260bps lastrcv:20480 pacing_rate 3446767704bps delivery_rate 2411538872bps delivered:2758157 app_limited busy:20188ms sndbuf_limited:7758ms(38.4%) unacked:63869 retrans:0/5304 dsack_dups:110 reordering:155 rcv_space:14480 rcv_ssthresh:64088 notsent:4329394 minrtt:281.92 snd_wnd:134217728

tcp   ESTAB      0      41451204  10.11.21.251:58911 10.15.21.251:5201  users:(("iperf3",pid=2702075,fd=5)) timer:(on,481ms,0) uid:4078 ino:46853834 sk:7008 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb536870912,f2876,w42816708,o0,bl0,d0) ts sack bbr wscale:14,14 rto:482 rtt:281.917/0.067 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:32283 ssthresh:53917 bytes_sent:2818257690 bytes_retrans:7531988 bytes_acked:2769274499 segs_out:2097743 segs_in:952028 data_segs_out:2097741 bbr:(bw:1495882600bps,mrtt:281,pacing_gain:1,cwnd_gain:2) send 1269713795bps lastsnd:2 lastrcv:20763 lastack:2 pacing_rate 1480923768bps delivery_rate 1268401072bps delivered:2061291 busy:20480ms sndbuf_limited:5269ms(25.7%) unacked:30978 retrans:0/5561 dsack_dups:88 reordering:41 rcv_space:14480 rcv_ssthresh:64088 minrtt:281 snd_wnd:134217728

It's still half of the ~5.68gbit aggregate I can get with each path but its almost twice what I was getting before. Which is probably about right if the TCP window size (128M) is split between the two paths? I still see quite high sndbuf_limited on each of the tcp connections though.

The wmem is accounted both on the egress subflow and the mptcp socket: to be able to fill 2 subflows, the mptcp sockets would need a sndbuf equal to the sum of the subflows' sndbuf.

I have hacky patch for that I hope to share somewhat soon.

I really need to stop reading cloudflare blog posts...

AFAIK cloudflare has quite a lot of cool stuff there, the problem is that mptcp is a bit 'special' ;)

Thanks for testing!

@pabeni
Copy link

pabeni commented Sep 15, 2023

I have hacky patch for that I hope to share somewhat soon.

I posted on the devel ML a few patches witch should:

  • avoid the scheduler being fooled by net.ipv4.tcp_notsent_lowat setting
  • better auto-tune the mptcp-level sndbuf

Overall you should see some nice improvement while running your test-case with the settings described above (no need to drop tcp_notsent_lowat, no need to increase further the tcp wmem).

The patches are available here:

https://lore.kernel.org/mptcp/cover.1694710538.git.pabeni@redhat.com/T/#t

and/or via pw:

https://patchwork.kernel.org/project/mptcp/list/?series=784264

as usual, any feedback more then welcome!

@daire-byrne
Copy link
Author

Thanks, but is there another dependent patch series I need for v6.5 too? While this patch applies, I get compile errors.

I removed all the mptcp connection hang fix patches so only this patch series (mptcp-misc-improvement.patch v3) was applied and I get:

net/mptcp/protocol.c: In function 'mptcp_schedule_work':
net/mptcp/protocol.c:894:31: error: 'ssk' undeclared (first use in this function); did you mean 'sk'?
  __mptcp_propagate_sndbuf(sk, ssk);
                               ^~~
                               sk
net/mptcp/protocol.c:894:31: note: each undeclared identifier is reported only once for each function it appears in
make[4]: *** [scripts/Makefile.build:243: net/mptcp/protocol.o] Error 1

Perhaps I need a later version of the "mptcp: fix another close hang-up" (v5) or "mptcp: fix missing close transition" (v2). I'll try those....

@pabeni
Copy link

pabeni commented Sep 18, 2023

Thanks, but is there another dependent patch series I need for v6.5 too? While this patch applies, I get compile errors.

I removed all the mptcp connection hang fix patches so only this patch series (mptcp-misc-improvement.patch v3) was applied and I get:

net/mptcp/protocol.c: In function 'mptcp_schedule_work':
net/mptcp/protocol.c:894:31: error: 'ssk' undeclared (first use in this function); did you mean 'sk'?
  __mptcp_propagate_sndbuf(sk, ssk);
                               ^~~
                               sk
net/mptcp/protocol.c:894:31: note: each undeclared identifier is reported only once for each function it appears in
make[4]: *** [scripts/Makefile.build:243: net/mptcp/protocol.o] Error 1

The above looks like a bad conflict resolution while applying the patches; 'mptcp_schedule_work' must not be touched by the last series. __mptcp_propagate_sndbuf() should be invoked only by mptcp_sk_clone_init().

On top of my head I don't think the series needs additional pre-req.

I suggest double checking the diff on top of v6.5 vs the original patch (vimdiff could help)

@daire-byrne
Copy link
Author

daire-byrne commented Sep 18, 2023

Yea, I had fuzz=2 set in the build script from a previous hacking effort which is why the patch was applied.

So the failing hunks against v6.5.3 are:

--- net/mptcp/protocol.h
+++ net/mptcp/protocol.h
@@ -123,6 +123,7 @@
 #define MPTCP_RETRANSMIT	4
 #define MPTCP_FLUSH_JOIN_LIST	5
 #define MPTCP_CONNECTED		6
+#define MPTCP_SYNC_SNDBUF	7
 
 struct mptcp_skb_cb {
 	u64 map_seq;

because v6.5.3 still has "#define MPTCP_RESET_SCHEDULER 7".

Then this:

--- net/mptcp/protocol.c
+++ net/mptcp/protocol.c
@@ -891,6 +891,7 @@ static bool __mptcp_finish_join(struct mptcp_sock *msk, struct sock *ssk)
 	mptcp_sockopt_sync_locked(msk, ssk);
 	mptcp_subflow_joined(msk, ssk);
 	mptcp_stop_tout_timer(sk);
+	__mptcp_propagate_sndbuf(sk, ssk);
 	return true;
 }
 
@@ -2437,6 +2439,7 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk,
 		WRITE_ONCE(msk->first, NULL);
 
 out:
+	__mptcp_sync_sndbuf(sk);
 	if (need_push)
 		__mptcp_push_pending(sk, 0);
 
@@ -3393,6 +3396,8 @@ static void mptcp_release_cb(struct sock *sk)
 			__mptcp_set_connected(sk);
 		if (__test_and_clear_bit(MPTCP_ERROR_REPORT, &msk->cb_flags))
 			__mptcp_error_report(sk);
+		if (__test_and_clear_bit(MPTCP_SYNC_SNDBUF, &msk->cb_flags))
+			__mptcp_sync_sndbuf(sk);
 	}
 
 	__mptcp_update_rmem(sk);

I can probably fix this one up but I'm not sure what to do about the MPTCP_RESET_SCHEDULER bit.

I guess this comes from the BPF scheduler work.

https://lore.kernel.org/all/20230821-upstream-net-next-20230818-v1-0-0c860fb256a8@kernel.org/

@pabeni
Copy link

pabeni commented Sep 18, 2023

So the failing hunks against v6.5.3 are:

--- net/mptcp/protocol.h
+++ net/mptcp/protocol.h
@@ -123,6 +123,7 @@
 #define MPTCP_RETRANSMIT	4
 #define MPTCP_FLUSH_JOIN_LIST	5
 #define MPTCP_CONNECTED		6
+#define MPTCP_SYNC_SNDBUF	7
 
 struct mptcp_skb_cb {
 	u64 map_seq;

because v6.5.3 still has "#define MPTCP_RESET_SCHEDULER 7".

Instead of the failing hunk, add:

#define MPTCP_SYNC_SNDBUF	8

It's sufficient to avoid conflict on the bit allocations

@daire-byrne
Copy link
Author

Thanks, I massaged the patch and got it to apply without any patch fuzz. However, I clearly made some mistake somewhere because now the host doesn't last long before it starts spewing errors and hangs.

Sep 18 16:32:47 lontransfer1 kernel: ------------[ cut here ]------------
Sep 18 16:32:47 lontransfer1 kernel: IPv4: Attempt to release TCP socket in state 1 000000001df7bf9f
Sep 18 16:32:47 lontransfer1 kernel: refcount_t: addition on 0; use-after-free.
Sep 18 16:32:47 lontransfer1 kernel: WARNING: CPU: 5 PID: 0 at lib/refcount.c:25 refcount_warn_saturate+0x7d/0x110

I have little doubt I made some error somewhere. I'll have another crack at it tomorrow with a coffee in hand.

@daire-byrne
Copy link
Author

I stepped through the various patches as carefully as I could, but there is clearly some instability being introduced by this series. Or it conflicts with the previous patches I have.

I might have to wait until things settle down and they are all included in v6.6 or v6.7 to properly test them again. In the meantime, we are using tcp_[rw]mem="8192 536870912 536870912" for the time being and getting good performance.

@pabeni
Copy link

pabeni commented Sep 20, 2023

I stepped through the various patches as carefully as I could, but there is clearly some instability being introduced by this series. Or it conflicts with the previous patches I have.

Indeed I reproduced a problem similar to the one reported above. AFAICS the root cause is actually preexistent, but the new patches make it much more easier to trigger (while before was very hard to be hit - so hard that syzkaller never triggered that).

I shared a new version of the series, comprising a fix for the problem I observed:

https://lore.kernel.org/mptcp/cover.1695200723.git.pabeni@redhat.com/T/#t
https://patchwork.kernel.org/project/mptcp/list/?series=785871

@daire-byrne
Copy link
Author

daire-byrne commented Sep 22, 2023

Yep, I think that latest series has done the trick - thanks!

With the previous series I was able to crash one of the hosts within a couple of minutes of running iperf tests.

I have been running tests for over 20 minutes now without incident.

I am also seeing good performance without needing to set initial wmem high - auto tuning seems to be doing the right thing now. The sndbuf_limited is a lot lower on each path:

tcp   ESTAB      0      66956406  10.11.21.251:59559 10.15.21.251:5201  users:(("iperf3",pid=64059,fd=5)) timer:(on,493ms,0) uid:4078 ino:421507 sk:16 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb536870912,f2058,w68089846,o0,bl15232,d0) ts sack bbr wscale:14,14 rto:492 rtt:291.889/0.048 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:117213 ssthresh:49407 bytes_sent:6626779662 bytes_retrans:1147084 bytes_acked:6613335349 segs_out:4912221 segs_in:2327744 data_segs_out:4912219 bbr:(bw:2296176472bps,mrtt:281,pacing_gain:1,cwnd_gain:2) send 4452575273bps lastrcv:35663 pacing_rate 2273214704bps delivery_rate 1877437480bps delivered:4902299 busy:35382ms sndbuf_limited:3102ms(8.8%) unacked:9157 retrans:0/849 dsack_dups:85 rcv_space:14480 rcv_ssthresh:129624 notsent:54422190 minrtt:281 snd_wnd:268484608  
tcp   ESTAB      0      21643212  10.11.22.251:34797 10.15.22.251:5201  users:(("iperf3",pid=64059,fd=5)) timer:(on,483ms,0) uid:4078 ino:421507 sk:e cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb536870912,f2184,w162297720,o0,bl0,d0) ts sack bbr wscale:14,14 rto:486 rtt:285.383/0.761 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:105099 bytes_sent:10644110266 bytes_retrans:49149948 bytes_acked:10573253093 segs_out:7856628 segs_in:2368015 data_segs_out:7856626 bbr:(bw:4662165400bps,mrtt:282,pacing_gain:1.25,cwnd_gain:2.25) send 4083416714bps lastrcv:35379 lastack:6 pacing_rate 5769429680bps delivery_rate 4542901904bps delivered:7805479 busy:35087ms sndbuf_limited:3390ms(9.7%) unacked:16020 retrans:0/36200 dsack_dups:1072 rcv_space:14480 rcv_ssthresh:129624 notsent:147866546 minrtt:282 snd_wnd:168460990

@daire-byrne
Copy link
Author

Just to follow up - the production hosts have been stable and performance is consistently good.

Many thanks!

@pabeni pabeni self-assigned this Nov 2, 2023
@pabeni
Copy link

pabeni commented Nov 2, 2023

all the required changes merged upstream:

commit 8005184
Author: Paolo Abeni pabeni@redhat.com
Date: Mon Oct 23 13:44:42 2023 -0700

mptcp: refactor sndbuf auto-tuning

@pabeni pabeni closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants