connections hanging after a long transfer #427

daire-byrne · 2023-08-04T09:56:21Z

Sorry, this might be a bit lacking in detail until I can dig into it a bit more, but I thought I should open an issue now in case there are any known problems or anyone can help guide my investigation.

We recently upgraded our kernel from v6.2.8 to v6.4.7 and we are running lots of mptcp rsync/rsyncd transfers between hosts. I actually had an earlier issue #295 which was addressed in v6.1 and this workload has been stable since then (using mptcpize/LD_PRELOAD).

But since updating to v6.4.7, we are seeing some unexplained "hangs" in the running transfers such that they never complete. We can start new mptcp connections (either more rsync or iperf) and those seem to work fine, but the running ones either stall or become really slow to progress.

If I kill the rsync transfers and restart them, they perform well again until the transfer hosts start having problems and the transfer tasks start to slow again (a few days later).

I have also seen this dumped from time to time but it's appearance does not seem to line up with the transfer hangs or slowness:

[ 2665.967025] ------------[ cut here ]------------
[ 2665.967025] WARNING: CPU: 4 PID: 181766 at net/mptcp/protocol.c:1299 mptcp_sendmsg_frag+0x87d/0x8d0
[ 2665.967028] Modules linked in: rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) cfg80211(E) act_police(E) cls_u32(E) sch_ingress(E) sch_drr(E) sch_tbf(E) sch_prio(E) nfsv3(E) nfs(E) fscache(E) netfs(E) rfkill(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vmw_vsock_vmci_transport(E) vsock(E) intel_rapl_common(E) crct10dif_pclmul(E) crc32_pclmul(E) polyval_clmulni(E) polyval_generic(E) ghash_clmulni_intel(E) sha512_ssse3(E) vmw_balloon(E) aesni_intel(E) crypto_simd(E) cryptd(E) joydev(E) input_leds(E) vmw_vmci(E) i2c_piix4(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sch_fq(E) tcp_bbr2(E) sunrpc(E) binfmt_misc(E) xfs(E) libcrc32c(E) vmwgfx(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) sd_mod(E) t10_pi(E) syscopyarea(E) crc64_rocksoft(E) sysfillrect(E) ata_generic(E) crc64(E) sysimgblt(E) pata_acpi(E) sg(E) ata_piix(E) crc32c_intel(E) drm(E) libata(E) serio_raw(E) vmxnet3(E) vmw_pvscsi(E) fuse(E)
[ 2665.967056] CPU: 4 PID: 181766 Comm: kworker/4:0 Tainted: G        W   E      6.4.7-1.dneg.x86_64 #1
[ 2665.967057] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 2665.967058] Workqueue: events mptcp_worker
[ 2665.967060] RIP: 0010:mptcp_sendmsg_frag+0x87d/0x8d0
[ 2665.967062] Code: f5 e9 ea f9 ff ff 48 83 7a 28 00 75 dc eb d6 0f 0b 44 88 45 b8 e9 46 fa ff ff 44 89 fe 48 89 df e8 a8 d8 ff ff e9 bf fb ff ff <0f> 0b e9 05 fb ff ff 8d 71 ff 48 8b 7d c0 48 63 f6 48 83 c6 03 48
[ 2665.967063] RSP: 0018:ffffa59b0080fcd0 EFLAGS: 00010202
[ 2665.967064] RAX: 00000000caa75226 RBX: ffff8f61cc1f7200 RCX: 0000000000000001
[ 2665.967065] RDX: 0000000000000000 RSI: 00000000fffff8c7 RDI: ffff8f707f730000
[ 2665.967065] RBP: ffffa59b0080fd38 R08: 0000000000000001 R09: 0000000000000001
[ 2665.967066] R10: 0000000000004ac6 R11: ffff8f60755083d0 R12: ffff8f6166972e40
[ 2665.967066] R13: 0000000000000001 R14: ffff8f63449ae000 R15: 0000000000000001
[ 2665.967067] FS:  0000000000000000(0000) GS:ffff8f781df00000(0000) knlGS:0000000000000000
[ 2665.967068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2665.967069] CR2: 00007fa5cba1b340 CR3: 000000010a2bc000 CR4: 0000000000350ee0
[ 2665.967077] Call Trace:
[ 2665.967078]  <TASK>
[ 2665.967078]  ? show_regs+0x62/0x70
[ 2665.967080]  ? __warn+0x89/0x150
[ 2665.967081]  ? mptcp_sendmsg_frag+0x87d/0x8d0
[ 2665.967083]  ? report_bug+0xfb/0x1e0
[ 2665.967084]  ? handle_bug+0x48/0x80
[ 2665.967085]  ? exc_invalid_op+0x18/0x70
[ 2665.967086]  ? asm_exc_invalid_op+0x1b/0x20
[ 2665.967089]  ? mptcp_sendmsg_frag+0x87d/0x8d0
[ 2665.967090]  __mptcp_push_pending+0xc8/0x210
[ 2665.967092]  mptcp_release_cb+0x206/0x330
[ 2665.967094]  ? _raw_spin_unlock_bh+0x1d/0x30
[ 2665.967096]  release_sock+0x48/0xa0
[ 2665.967097]  mptcp_worker+0xcc/0x400
[ 2665.967099]  ? __schedule+0x303/0x920
[ 2665.967100]  process_one_work+0x1a4/0x3c0
[ 2665.967102]  worker_thread+0x37/0x380
[ 2665.967104]  ? __pfx_worker_thread+0x10/0x10
[ 2665.967106]  kthread+0xf8/0x130
[ 2665.967107]  ? __pfx_kthread+0x10/0x10
[ 2665.967108]  ret_from_fork+0x2c/0x50
[ 2665.967111]  </TASK>
[ 2665.967111] ---[ end trace 0000000000000000 ]---

I do not recall ever seeing these on v6.2.8. But they are sporadic and only occur once every couple of days so may not be all that important.

All of this could be some other regression in v6.3 or v6.4 unrelated to mptcp, but such network oddities are usually noticed much quicker and addressed. I just suspect that it has more to do with our niche mptcp+rsync setup (having already had a previous issue).

I will try a v6.3 kernel too and see if I can replicate with that. Sorry for the lack of detail at this stage...

The text was updated successfully, but these errors were encountered:

pabeni · 2023-08-04T10:19:04Z

Thank you for the report. Did you upgraded all your hosts to 6.4.7 or do you have a mixed environment comprising older kernels? is the above the only splat you see in dmsg?

On top of my head, I do not recall observing similar problem. Could you please provide more details, e.g.:

nstat output
ss -neimMt output
free output
network topology
somewhat more detailed description of the involved workload
possibly pcap capture of the hanged/slow connections

pabeni · 2023-08-04T10:25:49Z

[ 2665.967056] CPU: 4 PID: 181766 Comm: kworker/4:0 Tainted: G W E 6.4.7-1.dneg.x86_64 #1

The above means the running kernel has loaded unsigned module[s], possibly OoT ones. Could you reproduce the issue without such taint? Just to exclude the problem is in some unknown module/source.

daire-byrne · 2023-08-04T11:02:43Z

Hmm, it is true I have applied a patch to add bbr2 congestion control to the kernel (and am using it), but I was also using that on v6.2.8 without issue.

I am definitely not ruling out that this is not related to mptcp and is something else... I will step through all patches and try to replicate.

All hosts doing the mptcp transfers have been updated to v6.4.7.

I will include all the info you suggest next time I get a hung example.

Thanks!

daire-byrne · 2023-08-04T17:38:31Z

So I found some hung examples - it takes a while for them to occur and then they start to build up until the system is mostly full of hung rsync commands.

The workload is basically running rsync on a client and connecting to rsyncd on the remote server (like in #295). So something like this on the client "mptcpize run rsync -av serverB::/files/ /tmp" where rsyncd is running on serverB with the mptcp LD_PRELOAD.

I looked at the hung tasks and an strace just says they are stuck in select timeout - both the client rsync and the remote rsync spawned by rsyncd/server. I actually have a 20 minute timeout on the rsync client/server that is supposed to kill the process if it makes no progress, but they are both in a permanent hung state long past that. So it thinks there is enough network communication going on not to kill them but they definitely aren't sending IO.

The client reports:

client # sudo ss -neimMt | grep 59662 -A1
tcp   ESTAB 0         0                 10.21.20.251:59662         10.29.20.251:873   ino:46698881 sk:a cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb110592,f0,w0,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:268 rtt:67.053/0.045 ato:40 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:30 bytes_sent:15336 bytes_acked:15337 bytes_received:26 segs_out:91 segs_in:87 data_segs_out:83 data_segs_in:2 bbr:(bw:1578880bps,mrtt:66.882,pacing_gain:2.88672,cwnd_gain:2.88672) send 4960852bps lastsnd:215977 lastrcv:30025008 lastack:215910 pacing_rate 4733184bps delivery_rate 1578824bps delivered:84 app_limited busy:4815ms rcv_space:14480 rcv_ssthresh:64088 minrtt:66.882 snd_wnd:65536 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/61f42ba5(id:0) seq:5afdd892e3700969 sfseq:f ssnoff:1eab1667 maplen:c                                                                                                                                                                                                                                                            
--
mptcp ESTAB 0         0                 10.21.20.251:59662         10.29.20.251:873   ino:46698881 sk:a4 cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb110592,f0,w0,o0,bl0,d398) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:61f42ba5 write_seq:c944b7e3b5fc71a9 snd_una:c944b7e3b5fc71a9 rcv_nxt:5afdd89b62b47195

client # free                                                                                                                                                                                                                                                                          [Fri, Aug 04 - 18:01:18]
              total        used        free      shared  buff/cache   available
Mem:       98584272     8701728     4497244      943844    85385300    80853148
Swap:             0           0           0

And the server reports:

server # sudo ss -neimMt | grep 59662 -A1
tcp   ESTAB    0         0               10.29.20.251:873           10.21.20.251:59662 timer:(keepalive,96min,0) ino:70060345 sk:69 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb82944,f0,w0,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:267 rtt:66.972/18.853 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:12 bytes_sent:26 bytes_acked:26 bytes_received:15336 segs_out:86 segs_in:91 data_segs_out:2 data_segs_in:83 bbr:(bw:165880bps,mrtt:66.904,pacing_gain:2.88672,cwnd_gain:2.88672) send 1986741bps lastsnd:30281624 lastrcv:472530 lastack:472530 pacing_rate 4729408bps delivery_rate 165728bps delivered:3 app_limited busy:134ms rcv_rtt:448 rcv_space:14600 rcv_ssthresh:64076 minrtt:66.904 snd_wnd:134217728 tcp-ulp-mptcp flags:MBbec token:0000(id:0)/1b304e79(id:0) seq:c944b7e3b5fc71a5 sfseq:3be5 ssnoff:453983f0 maplen:4       
  ptcp ESTAB    0         0               10.29.20.251:873           10.21.20.251:59662 ino:70060345 sk:8f cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb157289472,f204,w104877876,o0,bl0,d0) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:1b304e79 write_seq:5afdd89b68f2cca1 snd_una:5afdd89b62b47195 rcv_nxt:c944b7e3b5fc71a9  

server # free
              total        used        free      shared  buff/cache   available
Mem:       98584272     8701728     4497244      943844    85385300    80853148
Swap:             0           0           0

This particular transfer was running fine for around 5 mins before it stalled/hung. If I kill and restart the transfer it goes through fine on the second attempt.

It's hard to debug these hung tasks as they are running on a production system and there are constant streams of new transfers starting and stopping - mostly successfully.

I tried to tcpdump just this hung process by client port across all the mptcp interfaces but I couldn't see any packets at all. I'm not sure is there is a better way to trace a single process or if mptcp has actually opened different ports for this connection?

The topology is pretty simple (#285 ), I just have one interface set as a backup which is what I use to do the initial connection and then two signal endpoints at each end that traverse two different ISPs to get to each other which do all the bulk transfers:

client # sudo ip mptcp endpoint show
10.21.20.251 id 1 backup dev ens192 
10.21.21.251 id 2 signal dev ens224 
10.21.22.251 id 3 signal dev ens256

server # sudo ip mptcp endpoint show
10.29.20.251 id 1 backup dev ens192 
10.29.21.251 id 2 signal dev ens224 
10.29.22.251 id 3 signal dev ens256

Where the interface pairs can only route to each other - ens192 <-> ens192, ens224 <-> ens224, ens256 <-> ens256.

I think I'll switch to the v6.3.13 kernel just to add another data point and see how it compares to v6.2.8 and this v6.4.7.

pabeni · 2023-08-05T07:07:31Z

Thank you very much for the added info.

client # sudo ss -neimMt | grep 59662 -A1
tcp   ESTAB 0         0                 10.21.20.251:59662         10.29.20.251:873   ino:46698881 sk:a cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb110592,f0,w0,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:268 rtt:67.053/0.045 ato:40 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:30 bytes_sent:15336 bytes_acked:15337 bytes_received:26 segs_out:91 segs_in:87 data_segs_out:83 data_segs_in:2 bbr:(bw:1578880bps,mrtt:66.882,pacing_gain:2.88672,cwnd_gain:2.88672) send 4960852bps lastsnd:215977 lastrcv:30025008 lastack:215910 pacing_rate 4733184bps delivery_rate 1578824bps delivered:84 app_limited busy:4815ms rcv_space:14480 rcv_ssthresh:64088 minrtt:66.882 snd_wnd:65536 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/61f42ba5(id:0) seq:5afdd892e3700969 sfseq:f ssnoff:1eab1667 maplen:c                                                                                                                                                                                                                                                            
--
mptcp ESTAB 0         0                 10.21.20.251:59662         10.29.20.251:873   ino:46698881 sk:a4 cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb110592,f0,w0,o0,bl0,d398) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:61f42ba5 write_seq:c944b7e3b5fc71a9 snd_una:c944b7e3b5fc71a9 rcv_nxt:5afdd89b62b47195

client # free                                                                                                                                                                                                                                                                          [Fri, Aug 04 - 18:01:18]
              total        used        free      shared  buff/cache   available
Mem:       98584272     8701728     4497244      943844    85385300    80853148
Swap:             0           0           0

And the server reports:

server # sudo ss -neimMt | grep 59662 -A1
tcp   ESTAB    0         0               10.29.20.251:873           10.21.20.251:59662 timer:(keepalive,96min,0) ino:70060345 sk:69 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb82944,f0,w0,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:267 rtt:66.972/18.853 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:12 bytes_sent:26 bytes_acked:26 bytes_received:15336 segs_out:86 segs_in:91 data_segs_out:2 data_segs_in:83 bbr:(bw:165880bps,mrtt:66.904,pacing_gain:2.88672,cwnd_gain:2.88672) send 1986741bps lastsnd:30281624 lastrcv:472530 lastack:472530 pacing_rate 4729408bps delivery_rate 165728bps delivered:3 app_limited busy:134ms rcv_rtt:448 rcv_space:14600 rcv_ssthresh:64076 minrtt:66.904 snd_wnd:134217728 tcp-ulp-mptcp flags:MBbec token:0000(id:0)/1b304e79(id:0) seq:c944b7e3b5fc71a5 sfseq:3be5 ssnoff:453983f0 maplen:4       
  ptcp ESTAB    0         0               10.29.20.251:873           10.21.20.251:59662 ino:70060345 sk:8f cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb157289472,f204,w104877876,o0,bl0,d0) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:1b304e79 write_seq:5afdd89b68f2cca1 snd_una:5afdd89b62b47195 rcv_nxt:c944b7e3b5fc71a9  

server # free
              total        used        free      shared  buff/cache   available
Mem:       98584272     8701728     4497244      943844    85385300    80853148
Swap:             0           0           0

I suspected the issue could be related to fallback sockets, as the only/biggest functional change in the relevant kernel version are commit b7535cf and commit 81c1d02, but the above info should exclude that, as the mptcp socket is not fallen back one.

To have a better picture, it would be great to have:

the full output of the nstat command (on both sides)
the ss info for all the involved subflows. Filtering by client port is not enough because different subflows of the same mptcp socket will have different source port. Instead you can filter by token.
- first get the relevant mptcp token:
  token=$(ss --neimMO | sed -e 's/.token:([1234567890abcdef]).*/\1/')
- then get the info for all the subflows and the msk socket:
  ss -neimMOt |grep $token

Could you please provide both the above info? Having the full tuple for all the involved subflows could possibly allow you to capture some relevant packets...

In the provided dump we can see that mptcp socket as a suspiciusly large write queue (104877876 bytes), almost completely unacked 104749836. Unfortunatelly the subflow included into the dump is a backup one, so can't really give a clue.

BTW, my response in the next 2 weeks will be delayed for major force cause, but others may chime-in.

pabeni · 2023-08-05T07:32:29Z

bulk transfers:

client # sudo ip mptcp endpoint show
10.21.20.251 id 1 backup dev ens192 
10.21.21.251 id 2 signal dev ens224 
10.21.22.251 id 3 signal dev ens256

[...]

This part of the configuration is possibly/likely wrong/unintended. You should you signal only on the server side, unless you additionally specify a port. With a signal endpoint you are asking the other end (in this case the server) to connect back to the specified address (on the client side) where (usually) nobody is listening/accepting incoming connections (unless you also add a 'port', in that case the kernel will do the listen/accept part for the user-space program).

In your setup, reasonably, the server will end-up creating a good deal of tcp connections towards the client which will be dangling in half-open state, adding entropy to the system.

The:
10.21.20.251 id 1 backup dev ens192
endpoint is instead meaningful, as it tell the mptcp protocol to use subflows created using that source address as backup ones.

You need just that line of configuration on the client side.

daire-byrne · 2023-08-05T08:14:44Z

Yea, so even though I have written client/server here, in fact, they are both in the sense that we initiate transfers equally on each. So they can both start the client rsyncs to each other at any time.

This seemed to be the only configuration that worked (reliably uses both bulk endpoints) for that scenario?

I'm going to try setup some reproducible looping test to try trigger the hangs reliably rather than wait for production workloads to exhibit it. Hopefully that will help with gathering the extra debug info you suggest.

The large read/write queues are likely just because we are using very large window sizes (128MB) as these servers are separated by a long WAN distance (150ms+). The TCP memory of the systems has been tweaked to run this setup.

matttbe · 2023-08-07T07:57:11Z

The:
10.21.20.251 id 1 backup dev ens192
endpoint is instead meaningful, as it tell the mptcp protocol to use subflows created using that source address as backup ones.

But still, I think it is needed to add the subflow flag as well on the client side and signal on the server side. (except if these endpoints will always be used by the initial subflow)

daire-byrne · 2023-08-07T09:24:26Z

So it's not possible to have a "static" configuration between two symmetric hosts where either might want to initiate transfers between them at any time? We would have to keep switching the endpoints between signal and subflow depending on which host was going to be server and which was client for that particular transfer?

Even then, I expect there would be race conditions when multiple transfers can happen simultaneously in either direction.

I should say that we have been using this configuration (both hosts using "signal" endpoints) with success for a while now and we achieve our desired behaviour - transfers use both endpoints (backed by different ISPs) regardless of the direction and which host initiates the connection. And if one path goes down (ISP), the mptcp continues to use the remaining path as expected. We only use the "backup" interface to initiate the connections as it is a stable route that never goes down (unlike the "bulk" signal/subflow endpoints which can).

Certainly, our endpoints are static in the sense that we always expect to use the same two and no more or less. And they can only route to each other in pairs (so we can't or want to do "fullmesh").

In reality, it's even more complicated in that we have many "transfer" hosts in many different WAN locations that can all initiate transfers from each other.

My attempts to reproduce the issues with v6.4.7 have so far failed in a test bench setup so I fear the full production loads are required to trigger it - which makes everything harder to debug and isolate.

daire-byrne · 2023-08-07T16:58:59Z

Okay, I think I caught another hanging example on our production servers using the v6.3.13 kernel this time. I am going to continue to use "client" to denote the host where the connection was initiated from (rsync).

This time I used the token identifier to list all the relevant entries in the output of ss:

client # sudo ss -neimMt | grep -B1 2c72e3d 
tcp   ESTAB    0        0                 10.29.22.251:44495         10.21.22.251:873   ino:4407013 sk:403b cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb14155776,f0,w0,o0,bl0,d302) ts sack bbr2 wscale:14,14 rto:278 rtt:75.953/37.976 ato:80 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2048 bytes_acked:1 bytes_received:6039931593 segs_out:2152784 segs_in:4555228 data_segs_in:4555099 bbr:(bw:8bps,mrtt:75.953) send 298977315bps lastsnd:14695556 lastrcv:14653060 lastack:14653060 pacing_rate 854431680bps delivered:1 app_limited reordering:300 rcv_rtt:75.157 rcv_space:14480 rcv_ssthresh:134217728 minrtt:75.953 rcv_ooopack:318520 snd_wnd:65536 tcp-ulp-mptcp flags:Jjec token:e2a38ed6(id:3)/2c72e3d(id:3) seq:4ce61d747df856c8 sfseq:680208cc ssnoff:c93e559 maplen:1fe                                                                                                                                                                                                                                                                                                        
--
tcp   ESTAB    0        0                 10.29.21.251:48147         10.21.21.251:873   ino:4407013 sk:501a cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb14155776,f0,w0,o0,bl0,d433) ts sack bbr2 wscale:14,14 rto:270 rtt:67.403/33.701 ato:80 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:2048 bytes_acked:1 bytes_received:4261890377 segs_out:1611818 segs_in:3211879 data_segs_in:3211750 bbr:(bw:8bps,mrtt:67.403) send 336902274bps lastsnd:14695631 lastrcv:14653447 lastack:14653447 pacing_rate 962815336bps delivered:1 app_limited reordering:227 rcv_rtt:67.019 rcv_space:14480 rcv_ssthresh:134217728 minrtt:67.403 rcv_ooopack:39910 snd_wnd:65536 tcp-ulp-mptcp flags:Jjec token:e2a38ed6(id:2)/2c72e3d(id:2) seq:4ce61d747df83100 sfseq:fe072382 ssnoff:85cfe8c4 maplen:25c8                                                                                                                                                                                                                                                                                                       
--
tcp   ESTAB    0        0                 10.29.20.251:53752         10.21.20.251:873   ino:4407013 sk:506b cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb69120,f0,w0,o0,bl0,d77) ts sack bbr2 wscale:14,14 rto:269 rtt:68.583/2.945 ato:40 mss:1386 pmtu:1500 rcvmss:1358 advmss:1448 cwnd:20 bytes_sent:1462 bytes_acked:1463 bytes_received:873527 segs_out:768 segs_in:770 data_segs_out:38 data_segs_in:657 bbr:(bw:165880bps,mrtt:66.935,pacing_gain:2.88672,cwnd_gain:2.88672) send 3233454bps lastsnd:289163 lastrcv:10858911 lastack:289096 pacing_rate 4718080bps delivery_rate 165648bps delivered:39 app_limited busy:2856ms rcv_rtt:69.725 rcv_space:14480 rcv_ssthresh:1951020 minrtt:66.935 rcv_ooopack:9 snd_wnd:49152 tcp-ulp-mptcp flags:MmBbec token:0000(id:0)/2c72e3d(id:0) seq:4ce61d747e012e30 sfseq:cd460 ssnoff:e7245501 maplen:7fd8                                                                                                                                                                                                                                            
--
mptcp ESTAB    0        0                 10.29.20.251:53752         10.21.20.251:873   ino:4407013 sk:10 cgroup:unreachable:1 <->
	 skmem:(r0,rb536870912,t0,tb16384,f0,w0,o0,bl0,d0) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:2c72e3d write_seq:48d7a82ce774d01b snd_una:48d7a82ce774d01b rcv_nxt:4ce61d747e03ad68

server # sudo ss -neimMt | grep -B1 e2a38ed6  
tcp   ESTAB    0        154240            10.21.21.251:873           10.29.21.251:48147 timer:(persist,41sec,0) ino:2795602 sk:6 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb243122688,f1920,w157824,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:273 backoff:15 rtt:72.277/10.499 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:24159 ssthresh:11714 bytes_sent:4265799595 bytes_retrans:3909218 bytes_acked:4261890377 segs_out:3214461 segs_in:1611570 data_segs_out:3214333 bbr:(bw:1942571832bps,mrtt:66.838,pacing_gain:1.25,cwnd_gain:2) send 3706227320bps lastsnd:14729111 lastrcv:14771227 lastack:78810 pacing_rate 2403932648bps delivery_rate 1856292424bps delivered:3211665 app_limited busy:14771068ms rwnd_limited:14729002ms(99.7%) sndbuf_limited:308ms(0.0%) retrans:0/2891 dsack_dups:222 reordering:129 rcv_space:14600 rcv_ssthresh:64076 notsent:154240 minrtt:66 snd_wnd:4294967295 tcp-ulp-mptcp flags:Jec token:0000(id:2)/e2a38ed6(id:2) seq:48d7a82ce774ca78 sfseq:0 ssnoff:5dd0d394 maplen:0                                                                                     
--
tcp   ESTAB    0        0                 10.21.20.251:873           10.29.20.251:53752 timer:(keepalive,113min,0) ino:2795602 sk:3d cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb1893888,f0,w0,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:267 rtt:66.965/0.083 ato:40 mss:1386 pmtu:1500 rcvmss:1009 advmss:1448 cwnd:274 bytes_sent:874885 bytes_retrans:1358 bytes_acked:873527 bytes_received:1462 segs_out:770 segs_in:768 data_segs_out:658 data_segs_in:38 bbr:(bw:15714128bps,mrtt:66.888,pacing_gain:2.88672,cwnd_gain:2.88672) send 45368655bps lastsnd:10934571 lastrcv:364757 lastack:364757 pacing_rate 44908648bps delivery_rate 15713664bps delivered:658 app_limited busy:2543526ms rwnd_limited:2541932ms(99.9%) sndbuf_limited:168ms(0.0%) retrans:0/1 rcv_rtt:67 rcv_space:14600 rcv_ssthresh:64076 minrtt:66.888 snd_wnd:134217728 tcp-ulp-mptcp flags:MBbec token:0000(id:0)/e2a38ed6(id:0) seq:48d7a82ce774d017 sfseq:5b3 ssnoff:de9360b3 maplen:4                                                                                                                                       
--
tcp   ESTAB    0        131186            10.21.22.251:873           10.29.22.251:44495 timer:(persist,41sec,0) ino:2795602 sk:1053 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb103292928,f2702,w136562,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:280 backoff:15 rtt:79.749/8.882 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:14944 ssthresh:14917 bytes_sent:6043267815 bytes_retrans:3336222 bytes_acked:6039931593 segs_out:4557537 segs_in:2144864 data_segs_out:4557409 bbr:(bw:2207244944bps,mrtt:74.915,pacing_gain:1.25,cwnd_gain:2) send 2077757364bps lastsnd:14728731 lastrcv:14771152 lastack:78806 pacing_rate 2731465616bps delivery_rate 2202249200bps delivered:4555100 app_limited busy:14770324ms rwnd_limited:14728372ms(99.7%) sndbuf_limited:379ms(0.0%) retrans:0/2485 dsack_dups:175 reordering:300 rcv_space:14600 rcv_ssthresh:64076 notsent:131186 minrtt:74.915 snd_wnd:4294967295 tcp-ulp-mptcp flags:Jec token:0000(id:3)/e2a38ed6(id:3) seq:48d7a82ce774cadb sfseq:0 ssnoff:3c20a83a maplen:0                                                                               
--
mptcp ESTAB    0        0                 10.21.20.251:873           10.29.20.251:53752 ino:2795602 sk:10ae cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb243122688,f2160,w162088848,o0,bl0,d0) subflows:2 add_addr_signal:2 add_addr_accepted:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:e2a38ed6 write_seq:4ce61d7487a9f000 snd_una:4ce61d747e03ad68 rcv_nxt:48d7a82ce774d01b

The output of nstat is constantly changing depedning on the other transfers happening so these just represent a brief snapshot:

client  # nstat
#kernel
IpInReceives                    1306147            0.0
IpInDelivers                    1306146            0.0
IpOutRequests                   551760             0.0
IpFragOKs                       1                  0.0
IpFragCreates                   2                  0.0
IcmpInMsgs                      46                 0.0
IcmpInEchos                     35                 0.0
IcmpInEchoReps                  11                 0.0
IcmpOutMsgs                     44                 0.0
IcmpOutEchos                    9                  0.0
IcmpOutEchoReps                 35                 0.0
IcmpMsgInType0                  11                 0.0
IcmpMsgInType8                  35                 0.0
IcmpMsgOutType0                 35                 0.0
IcmpMsgOutType8                 9                  0.0
TcpActiveOpens                  22                 0.0
TcpEstabResets                  4                  0.0
TcpInSegs                       1306146            0.0
TcpOutSegs                      1792043            0.0
TcpRetransSegs                  2172               0.0
TcpOutRsts                      7                  0.0
UdpOutDatagrams                 1783               0.0
TcpExtTW                        16                 0.0
TcpExtDelayedACKs               120                0.0
TcpExtDelayedACKLocked          3                  0.0
TcpExtDelayedACKLost            1061               0.0
TcpExtTCPHPHits                 23975              0.0
TcpExtTCPPureAcks               209818             0.0
TcpExtTCPHPAcks                 12086              0.0
TcpExtTCPSackRecovery           21                 0.0
TcpExtTCPLostRetransmit         207                0.0
TcpExtTCPFastRetrans            2131               0.0
TcpExtTCPLossProbes             135                0.0
TcpExtTCPBacklogCoalesce        60628              0.0
TcpExtTCPDSACKOldSent           1061               0.0
TcpExtTCPDSACKRecv              23                 0.0
TcpExtTCPAbortOnData            4                  0.0
TcpExtTCPDSACKIgnoredNoUndo     13                 0.0
TcpExtTCPSackShifted            5668               0.0
TcpExtTCPSackMerged             3520               0.0
TcpExtTCPSackShiftFallback      63                 0.0
TcpExtTCPRetransFail            18                 0.0
TcpExtTCPRcvCoalesce            171588             0.0
TcpExtTCPOFOQueue               131472             0.0
TcpExtTCPAutoCorking            5981               0.0
TcpExtTCPOrigDataSent           1314598            0.0
TcpExtTCPKeepAlive              8                  0.0
TcpExtTCPDelivered              1319145            0.0
TcpExtTCPAckCompressed          122904             0.0
TcpExtTCPDSACKRecvSegs          23                 0.0
IpExtInOctets                   1920903195         0.0
IpExtOutOctets                  1885051786         0.0
IpExtInNoECTPkts                1550941            0.0
MPTcpExtMPCapableSYNTX          1                  0.0
MPTcpExtMPCapableSYNACKRX       1                  0.0
MPTcpExtMPTCPRetrans            5                  0.0
MPTcpExtMPJoinSynAckRx          3                  0.0
MPTcpExtOFOQueueTail            632809             0.0
MPTcpExtOFOQueue                638090             0.0
MPTcpExtOFOMerge                595788             0.0
MPTcpExtDuplicateData           7                  0.0
MPTcpExtAddAddr                 2                  0.0
MPTcpExtEchoAdd                 3                  0.0
MPTcpExtMPPrioTx                1                  0.0
MPTcpExtMPPrioRx                1                  0.0
MPTcpExtSndWndShared            6                  0.0
MPTcpExtRcvWndShared            310980             0.0
MPTcpExtRcvWndConflictUpdate    1                  0.0
MPTcpExtRcvWndConflict          9                  0.0

server # sudo nstat
#kernel
IpInReceives                    649817             0.0
IpInDelivers                    649818             0.0
IpOutRequests                   377233             0.0
IcmpInMsgs                      17                 0.0
IcmpInEchoReps                  17                 0.0
IcmpOutMsgs                     18                 0.0
IcmpOutEchos                    18                 0.0
IcmpMsgInType0                  17                 0.0
IcmpMsgOutType8                 18                 0.0
TcpActiveOpens                  12                 0.0
TcpEstabResets                  4                  0.0
TcpInSegs                       649778             0.0
TcpOutSegs                      695418             0.0
TcpRetransSegs                  1                  0.0
TcpOutRsts                      8                  0.0
UdpInDatagrams                  6                  0.0
UdpOutDatagrams                 32                 0.0
UdpIgnoredMulti                 1                  0.0
TcpExtDelayedACKs               60                 0.0
TcpExtDelayedACKLost            244                0.0
TcpExtTCPHPHits                 4638               0.0
TcpExtTCPPureAcks               1781               0.0
TcpExtTCPHPAcks                 7832               0.0
TcpExtTCPLossProbes             1                  0.0
TcpExtTCPBacklogCoalesce        2301               0.0
TcpExtTCPDSACKOldSent           244                0.0
TcpExtTCPDSACKRecv              1                  0.0
TcpExtTCPAbortOnData            4                  0.0
TcpExtTCPDSACKIgnoredNoUndo     1                  0.0
TcpExtTCPRcvCoalesce            48099              0.0
TcpExtTCPOFOQueue               51673              0.0
TcpExtTCPAutoCorking            773                0.0
TcpExtTCPOrigDataSent           332654             0.0
TcpExtTCPDelivered              332667             0.0
TcpExtTCPAckCompressed          20869              0.0
TcpExtTCPDSACKRecvSegs          1                  0.0
IpExtInBcastPkts                2                  0.0
IpExtInOctets                   903823440          0.0
IpExtOutOctets                  499497617          0.0
IpExtInBcastOctets              713                0.0
IpExtInNoECTPkts                649918             0.0
MPTcpExtMPCapableSYNTX          2                  0.0
MPTcpExtMPCapableSYNACKRX       2                  0.0
MPTcpExtMPJoinSynAckRx          4                  0.0
MPTcpExtOFOQueueTail            372102             0.0
MPTcpExtOFOQueue                374674             0.0
MPTcpExtOFOMerge                344948             0.0
MPTcpExtDuplicateData           4                  0.0
MPTcpExtAddAddr                 4                  0.0
MPTcpExtEchoAdd                 4                  0.0
MPTcpExtMPPrioTx                2                  0.0
MPTcpExtMPPrioRx                2                  0.0
MPTcpExtSndWndShared            7                  0.0
MPTcpExtRcvWndShared            179542             0.0
MPTcpExtRcvWndConflict          41                 0.0

The rsync command never seems to time out and I can see the long keepalive/persist reported on the server side connections.

With all the reported client ports on each interface/endpoint above, I tried to tcpdump and capture on both client and server but nothing.

I'm not sure if it's relevant, but I can see the temporary rsync file that was being copied at the time of the hang on disk - obviously it is also "stuck".

Again, if I kill the stuck rsync and re-run it, it goes through fine without hanging.

It could very well be that our transfers have hung like this before, but they would at least always timeout (rsync --timeout=60) and be automatically retried. But now (v6.3+ ?), it seems the connections are in some state such that rsync never times out or closes the connection.

daire-byrne · 2023-08-11T10:41:54Z

Okay, this example of a hanging transfer is somewhat different to my previous example and may have more in common with #295. I also can't say with complete confidence that this wasn't happening in v6.2.8 but I would have thought I'd have noticed it before - the thing is, I am only now looking very closely at every hung rsync process.

So for this one the remote (server) rsyncd process dies for some reason (likely a momentary storage issue):

2023/08/10 23:33:59 [3311591] [sender] io timeout after 901 seconds -- exiting
2023/08/10 23:33:59 [3311591] sent 7346976492 bytes  received 540439 bytes  total size 12574925000
2023/08/10 23:33:59 [3311591] rsync error: timeout in data send/receive (code 30) at io.c(197) [sender=3.2.3]

But the corresponding client rsync does not seem to get the hint and hangs around indefinitely:

client # sudo ss -neimMtp | grep aeab9cf3 -B1
mptcp ESTAB 190592 0                 10.29.20.251:43430         10.25.20.251:873   users:(("rsync",pid=1750778,fd=4),("rsync",pid=1750753,fd=4)) ino:74059654 sk:7247 cgroup:unreachable:1 <->
	 skmem:(r190592,rb536870912,t0,tb14155776,f3464,w632,o0,bl0,d0) add_addr_signal:2 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 remote_key token:aeab9cf3 write_seq:e9ac43dae6a3d01 snd_una:e9ac43dae6a3b09 rcv_nxt:5976acb55351092b

So it looks like the two expected TCP connections have gone but the mptcp connection is still open? And this in turn is holding open the rsync process such that it never quits or times out.

Like I said, it's different to the hanging of open connections in my last example, but this one is no less disruptive.

matttbe · 2023-08-11T12:00:50Z

Thank you for the new tests!

Just to avoid any misunderstanding: this bug looks important and a fix is needed. Any new details are important, it's just a shame there is no simple and short way to reproduce it (using our export branch) but that's how it is :)
There might just be a bit of delay in our replies simply because of the holiday period impacting us directly or indirectly. Sorry for that, I hope you are patient :)

daire-byrne · 2023-08-13T07:41:03Z

Not a problem. I appreciate the work that goes into mptcp and I'm happy to help find and squash bugs - helping in the only way I know how! :)

I have gone back to using the v6.2.8 kernel for now just just double check there really were no "hanging" rsync transfers using that version.

And in case it wasn't clear, the previous two cases I reported were with v6.3 so the title of this bug (v6.4) is probably misleading now.

daire-byrne · 2023-08-14T16:51:19Z

Okay, so the "second" issue/example I gave where a client rsync never dies despite the server hanging up, and the mptcp connection shows in "ss" despite there being no corresponding active tcp subflows - is also present in v6.2.8.

So it looks like that issue has always been there I just never noticed it before (it's low frequency) until looking more closely.

I have yet to see anything like the first example (keepalive/persist reported on tcp connections) with v6.2.8 so maybe that's the main "regression" or change in behaviour here.

I'll see if I can can start bisecting the kernel to narrow it down, but it's likely going to be slow going.

daire-byrne · 2023-08-18T09:55:57Z

For the long standing issue (v6.2.8), where a client rsync will not quit, the mptcp connection still reports in ss despite the TCP connections missing - I have another example that seems to trigger it quite frequently.

If the rsync scans a few files to build the file list but then doesn't actually transfer anything making the connection very short, we tend to see some hanging rsync client commands. Maybe related to opening/closing lots of short connections in quick succession?

For example the server log might show:

2023/08/17 18:37:33 [1969621] rsync on root//blah/*.exr from transfer1.dneg.com (10.29.20.251)
2023/08/17 18:37:34 [1969621] building file list
2023/08/17 18:37:35 [1969621] sent 118 bytes  received 25 bytes  total size 136542

But on the client the corresponding rsync processes might get stuck on connect:

# sudo strace -f -p 3884616
strace: Process 3884616 attached
connect(3, {sa_family=AF_INET, sin_port=htons(873), sin_addr=inet_addr("10.27.20.251")}, 16

ss -neimMtp | grep 3884616 -A1
mptcp SYN-SENT 0      0                 10.29.20.251:36888         10.27.20.251:873   users:(("rsync",pid=3884616,fd=3)) ino:55530463 sk:8c cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb16384,f0,w0,o0,bl0,d0) add_addr_signal:1 subflows_max:4 add_addr_signal_max:2 add_addr_accepted_max:4 token:520fd230

I am still trying to work through kernel bisects and reproduce the other (bigger) issue. It's proving hard to reproduce.

pabeni · 2023-08-21T13:44:09Z

Regarding the 'first' issue:

server # sudo ss -neimMt | grep -B1 e2a38ed6  
tcp   ESTAB    0        154240            10.21.21.251:873           10.29.21.251:48147 timer:(persist,41sec,0) ino:2795602 sk:6 cgroup:unreachable:1 <->
	 skmem:(r0,rb262144,t0,tb243122688,f1920,w157824,o0,bl0,d0) ts sack bbr2 wscale:14,14 rto:273 backoff:15 rtt:72.277/10.499 mss:1386 pmtu:1500 rcvmss:536 advmss:1448 cwnd:24159 ssthresh:11714 bytes_sent:4265799595 bytes_retrans:3909218 bytes_acked:4261890377 segs_out:3214461 segs_in:1611570 data_segs_out:3214333 bbr:(bw:1942571832bps,mrtt:66.838,pacing_gain:1.25,cwnd_gain:2) send 3706227320bps lastsnd:14729111 lastrcv:14771227 lastack:78810 pacing_rate 2403932648bps delivery_rate 1856292424bps delivered:3211665 app_limited busy:14771068ms rwnd_limited:14729002ms(99.7%) sndbuf_limited:308ms(0.0%) retrans:0/2891 dsack_dups:222 reordering:129 rcv_space:14600 rcv_ssthresh:64076 notsent:154240 minrtt:66 snd_wnd:4294967295 tcp-ulp-mptcp flags:Jec token:0000(id:2)/e2a38ed6(id:2) seq:48d7a82ce774ca78 sfseq:0 ssnoff:5dd0d394 maplen:0

here the rtx queue is not empty (bytes acked < bytes_sent) -> no mptcp-level rtx.

the snd_wnd has an unexpected value: 4294967295 == 0xffffffff, that means that tcp-level rtx can't happen:

https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c#L3176

(avail_wnd will be negative)

zero window probe could happen, but window is not zero ;)

Why snd_wnd is so large?

An unusual mptcp-level mib is non zero:

MPTcpExtRcvWndConflictUpdate    1                  0.0

that means:

https://elixir.bootlin.com/linux/latest/source/net/mptcp/options.c#L1257

The latter code looks buggy: it jumps to the 'raise_win' label, which will use the rcv_wnd_old value to update the tcp-level window, but in that code path such variable is outdated, the current value for the mptcp-level receive window is carried by the rcv_wnd variable.

TL;DR: @daire-byrne: could you please try the following patch? that should at least address the issue described above

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index c254accb14de..295ed37a489c 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -1269,12 +1269,12 @@ static void mptcp_set_rwin(struct tcp_sock *tp, struct tcphdr *th)
 
 			if (rcv_wnd == rcv_wnd_old)
 				break;
-			if (before64(rcv_wnd_new, rcv_wnd)) {
+			rcv_wnd_old = rcv_wnd;
+			if (before64(rcv_wnd_new, rcv_wnd_old)) {
 				MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDCONFLICTUPDATE);
 				goto raise_win;
 			}
 			MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_RCVWNDCONFLICT);
-			rcv_wnd_old = rcv_wnd;
 		}
 		return;
 	}

Side Notes:

the patch could be whitespace-damanged in the c&p process, if so, you could try applying it manually.
the buggy code is there since 5.19, no idea why that triggers only in more recent kernel - assuming that the above discussion is correct. Possibly some unrelated timing change (process scheduler, nic driver, fs, etc...)

pabeni · 2023-08-22T08:06:17Z

For the long standing issue (v6.2.8), where a client rsync will not quit, the mptcp connection still reports in ss despite the TCP connections missing - I have another example that seems to trigger it quite frequently.

I think this could be caused by:

https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_input.c#L4345

On incoming TCP reset (ev. carrying a MPTCP fast-close) the TCP stack closes (via tcp_done()) the TCP socket before propagating the socket error (via sk_error_report()). In turn the MPTCP code could remove the tcp subflow from the subflow list (in response to tcp_done()) before trying to propagate the socket error in response to sk_error_report().

As the mptcp-level socket is closed as an effect of the error propagation, and the latter requires the errored subflow still being present into the subflows list at __mptcp_error_report() time, the all the above could lead to the reported issue.

@daire-byrne: I just shared a couple of patches on the ML trying to address the above (that is, only the "2nd" issue tracked here).

https://patchwork.kernel.org/project/mptcp/list/?series=778115

Could you please give them a spin in your testbed? (they are on top of the current export branch, but hopefully should apply cleanly to the current Linus tree)

Side note: it would be probably better track the 2 issues described here separately.

daire-byrne · 2023-08-22T16:44:50Z

Thanks you for these patches!

I have applied both cleanly to v6.4.11 and all our transfer hosts globally (9 machines) are now running with it.

Because these hosts run thousands of rsync transfers per day but only a handful get stuck, it may take me a few days to sift through any remaining hangs. I have been waiting for them to build up naturally (easier to spot) and then debugging each one every day.

Apologies that this ticket ended up mixing issues, but it was only when there was a noticeable increase in hanging rsync commands with v6.4 did I look more closely and realise that there was more than one issue at play and that one had always been there (but very low frequency).

On the subject of why the rcv_wnd bug might have become more noticeable on v6.3/v6.4 - I'm not entirely sure. I did however notice that the ratio of softirq CPU to packet count changed (for the better - less cpu) with v6.4 with the vmxnet3 driver we are using which might suggest some big change in some related code?

I have also just double checked any patches (other than bbr v2/3) that I was carrying and it looks like I forgot to remove this one in my v6.4 testing (slated for inclusion in v6.5):

https://github.com/cloudflare/linux/blob/master/patches/0020-Add-a-sysctl-to-allow-TCP-window-shrinking-in-order-.patch

I did not have that in my v6.3 kernel testing though and I am unsure if that could have interacted with the rcv_wnd code you patched? Apologies, I should have realised I had applied that patch for testing when opening this ticket. I am not running with it anymore (but will test adding it again at a later point).

Anyway, I will report back in a few days - thanks again!

matttbe · 2023-08-22T16:57:30Z

Thank you for all the tests.

I have applied both cleanly to v6.4.11 and all our transfer hosts globally (9 machines) are now running with it.

Just to be sure, do you mean you ran the tests with the two series or only the two patches of the series Paolo sent on the ML (for the 2nd issue)?

If I'm not mistaken, the first patch modifying mptcp_set_rwin() is for the first issue and it is still valid (but it needs to be validated).

(Do not hesitate to create a dedicated issue to avoid confusions ;-) )

pabeni · 2023-08-22T17:26:51Z

On the subject of why the rcv_wnd bug might have become more noticeable on v6.3/v6.4 - I'm not entirely sure. I did however notice that the ratio of softirq CPU to packet count changed (for the better - less cpu) with v6.4 with the vmxnet3 driver we are using which might suggest some big change in some related code?

git log suggests this change:

commit 3bced313b9a5a237c347e0f079c8c2fe4b3935aa
Author: Ronak Doshi <doshir@vmware.com>
Date:   Thu Mar 23 13:07:21 2023 -0700

    vmxnet3: use gro callback when UPT is enabled

which entered 6.3 and could have changed significantly the timing of ingress TCP packet - before that change GRO was basically disabled on some (most???) scenarios.

daire-byrne · 2023-08-22T18:37:06Z

Yes, sorry, I meant I have applied the patches for both "issue 1" (rcv_wnd) and "issue 2" (close transition) at the same time.

I * think* I am already seeing a positive impact on the long standing close transition one - but I'll know more in a day or two.

The rcv_wnd/mptcp_set_rwin one seemed harder to come across but once one happens on a server it felt like more then happened more quickly after that.

But again, in the early days of debugging this, I was not collecting as much useful debug info as I am now.

pabeni · 2023-08-22T20:11:03Z

Should some rsync connections being stuck even with the patched kernel, please try to collect the same info already mentioned in the past (ss output, nstat output, pcap capture)

matttbe · 2023-08-23T08:14:38Z

FYI, I just created a new ticket (#429) for the second issue (MPTCP connections not transitioning to the Close state).

For the initial issue, the patch proposed by @pabeni is visible here above.

daire-byrne · 2023-08-25T10:24:38Z

I have not seen a connection hang due to this bug since applying the patch. However, this one took a while to trigger and then seemingly once it happened once on a host it was more likely to happen again.

I will report back in a week or so.

In case multiple subflows race to update the mptcp-level receive window, the subflow losing the race should use the window value provided by the "winning" subflow to update it's own tcp-level rcv_wnd. To such goal, the current code bogusly uses the mptcp-level rcv_wnd value as observed before the update attempt. On unlucky circumstances that may lead to TCP-level window shrinkage, and stall the other end. Address the issue feeding to the rcv wnd update the correct value. Fixes: f3589be ("mptcp: never shrink offered window") Closes: #427 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org>

In case multiple subflows race to update the mptcp-level receive window, the subflow losing the race should use the window value provided by the "winning" subflow to update it's own tcp-level rcv_wnd. To such goal, the current code bogusly uses the mptcp-level rcv_wnd value as observed before the update attempt. On unlucky circumstances that may lead to TCP-level window shrinkage, and stall the other end. Address the issue feeding to the rcv wnd update the correct value. Fixes: f3589be ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Closes: multipath-tcp/mptcp_net-next#427 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>

In case multiple subflows race to update the mptcp-level receive window, the subflow losing the race should use the window value provided by the "winning" subflow to update it's own tcp-level rcv_wnd. To such goal, the current code bogusly uses the mptcp-level rcv_wnd value as observed before the update attempt. On unlucky circumstances that may lead to TCP-level window shrinkage, and stall the other end. Address the issue feeding to the rcv wnd update the correct value. Fixes: f3589be ("mptcp: never shrink offered window") Closes: #427 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org>

In case multiple subflows race to update the mptcp-level receive window, the subflow losing the race should use the window value provided by the "winning" subflow to update it's own tcp-level rcv_wnd. To such goal, the current code bogusly uses the mptcp-level rcv_wnd value as observed before the update attempt. On unlucky circumstances that may lead to TCP-level window shrinkage, and stall the other end. Address the issue feeding to the rcv wnd update the correct value. Fixes: f3589be ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Closes: multipath-tcp/mptcp_net-next#427 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>

commit 6bec041 upstream. In case multiple subflows race to update the mptcp-level receive window, the subflow losing the race should use the window value provided by the "winning" subflow to update it's own tcp-level rcv_wnd. To such goal, the current code bogusly uses the mptcp-level rcv_wnd value as observed before the update attempt. On unlucky circumstances that may lead to TCP-level window shrinkage, and stall the other end. Address the issue feeding to the rcv wnd update the correct value. Fixes: f3589be ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Closes: multipath-tcp/mptcp_net-next#427 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

BugLink: https://bugs.launchpad.net/bugs/2046197 commit 6bec041147a2a64a490d1f813e8a004443061b38 upstream. In case multiple subflows race to update the mptcp-level receive window, the subflow losing the race should use the window value provided by the "winning" subflow to update it's own tcp-level rcv_wnd. To such goal, the current code bogusly uses the mptcp-level rcv_wnd value as observed before the update attempt. On unlucky circumstances that may lead to TCP-level window shrinkage, and stall the other end. Address the issue feeding to the rcv wnd update the correct value. Fixes: f3589be ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Closes: multipath-tcp/mptcp_net-next#427 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>

BugLink: https://bugs.launchpad.net/bugs/2044174 commit 6bec041 upstream. In case multiple subflows race to update the mptcp-level receive window, the subflow losing the race should use the window value provided by the "winning" subflow to update it's own tcp-level rcv_wnd. To such goal, the current code bogusly uses the mptcp-level rcv_wnd value as observed before the update attempt. On unlucky circumstances that may lead to TCP-level window shrinkage, and stall the other end. Address the issue feeding to the rcv wnd update the correct value. Fixes: f3589be ("mptcp: never shrink offered window") Cc: stable@vger.kernel.org Closes: multipath-tcp/mptcp_net-next#427 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>

matttbe added the bug label Aug 9, 2023

matttbe mentioned this issue Aug 23, 2023

Missing close transition #429

Closed

matttbe assigned pabeni Aug 23, 2023

matttbe changed the title ~~possible mptcp regression in v6.4 kernel?~~ connections hanging after a long transfer Aug 23, 2023

matttbe closed this as completed in bd46cd8 Aug 29, 2023

daire-byrne mentioned this issue Sep 14, 2023

mptcp vs net.ipv4.tcp_shrink_window #439

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connections hanging after a long transfer #427

connections hanging after a long transfer #427

daire-byrne commented Aug 4, 2023 •

edited by matttbe

Loading

pabeni commented Aug 4, 2023 •

edited

Loading

pabeni commented Aug 4, 2023

daire-byrne commented Aug 4, 2023

daire-byrne commented Aug 4, 2023

pabeni commented Aug 5, 2023

pabeni commented Aug 5, 2023 •

edited

Loading

daire-byrne commented Aug 5, 2023

matttbe commented Aug 7, 2023

daire-byrne commented Aug 7, 2023

daire-byrne commented Aug 7, 2023

daire-byrne commented Aug 11, 2023

matttbe commented Aug 11, 2023

daire-byrne commented Aug 13, 2023

daire-byrne commented Aug 14, 2023

daire-byrne commented Aug 18, 2023

pabeni commented Aug 21, 2023 •

edited by matttbe

Loading

pabeni commented Aug 22, 2023

daire-byrne commented Aug 22, 2023

matttbe commented Aug 22, 2023

pabeni commented Aug 22, 2023 •

edited by matttbe

Loading

daire-byrne commented Aug 22, 2023

pabeni commented Aug 22, 2023

matttbe commented Aug 23, 2023 •

edited

Loading

daire-byrne commented Aug 25, 2023

connections hanging after a long transfer #427

connections hanging after a long transfer #427

Comments

daire-byrne commented Aug 4, 2023 • edited by matttbe Loading

pabeni commented Aug 4, 2023 • edited Loading

pabeni commented Aug 4, 2023

daire-byrne commented Aug 4, 2023

daire-byrne commented Aug 4, 2023

pabeni commented Aug 5, 2023

pabeni commented Aug 5, 2023 • edited Loading

daire-byrne commented Aug 5, 2023

matttbe commented Aug 7, 2023

daire-byrne commented Aug 7, 2023

daire-byrne commented Aug 7, 2023

daire-byrne commented Aug 11, 2023

matttbe commented Aug 11, 2023

daire-byrne commented Aug 13, 2023

daire-byrne commented Aug 14, 2023

daire-byrne commented Aug 18, 2023

pabeni commented Aug 21, 2023 • edited by matttbe Loading

pabeni commented Aug 22, 2023

daire-byrne commented Aug 22, 2023

matttbe commented Aug 22, 2023

pabeni commented Aug 22, 2023 • edited by matttbe Loading

daire-byrne commented Aug 22, 2023

pabeni commented Aug 22, 2023

matttbe commented Aug 23, 2023 • edited Loading

daire-byrne commented Aug 25, 2023

daire-byrne commented Aug 4, 2023 •

edited by matttbe

Loading

pabeni commented Aug 4, 2023 •

edited

Loading

pabeni commented Aug 5, 2023 •

edited

Loading

pabeni commented Aug 21, 2023 •

edited by matttbe

Loading

pabeni commented Aug 22, 2023 •

edited by matttbe

Loading

matttbe commented Aug 23, 2023 •

edited

Loading