-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mptcp vs tcp performance over long fat networks #437
Comments
And the output of nstat before/after the mptcp iperf test (in case it's of interest):
|
Thanks for the detailed report!
As a wild guess, I think there are 2 main root causes:
One dirty hack to validate the above could possibly be setting both tcp_wmem, tcp_rmem to: WARNING!!! only doable if The above actually triggers some follow-up questions:
|
It looks like we have at least a functional problem there: mptcp socket wmem is limited to max(subflows wmem), but in this scenario (and probably in any scenarios where the tcp subflow is potentially sndbuf limited) we actually need mptcp wmem to be sum(subflows wmem). |
addendum: experimenting in a simulated env (multiple veths + netem) with a similar configuration, it looks like using the bbr GC affect badly the mptcp performances, e.g, with a single subflow:
|
Thanks for considering it @pabeni. I tried with "8192 536870912 536870912" but there was no noticeable improvement in performance for the mptcp case. When I tried with "8192 536870912 1073741824" (256G max window), there was a slight increase from 1.6gbit -> 2.25 gbit for mptcp. But then the ens256 route using just TCP also increased from ~3.4gbit -> 5.6gbit, so mptcp is still falling behind relatively speaking. I have not observed the BBR degradation you are seeing but then (confusingly) I am also using BBRv3 which is scheduled to replace the current BBR implementation in the kernel (patches from: https://gitlab.com/xanmod/linux-patches/-/tree/master/linux-6.4.y-xanmod/net/tcp/bbr3). In our environment using "real world" multi-gigabit internet connections, cubic congestion just can't get above around 500mbit for a single stream of TCP over 280ms. With BBRv3, we can sustain ~6gbit with the same 256M window. I think we are already fine with throwing lots of TCP memory at the problem of long fat network transfers so that is not really a limitation for us. A back of the envelope calculation suggests that the absolute maximum mptcp subflows we would have open at once would be less than 200. And then add another 100 for various things like NFS connections (nconnect=8) and we probably max out at 300 connections per server (worst case). Which would equate to around 72G of TCP memory if we were using the default net.ipv4.tcp_adv_win_scale. We already use 96G servers and set the max TCP mem to 64G and I have never seen the memory used go above a few gigabytes (according to /proc/net/sockstat) for any period of time. But this is also likely because we are not able to fully open the windows with the mptcp connections atm. |
In my veth-based testbed, the above sysctl is the root cause for the very bad performances I'm observing for MPTCP with the BBR cong algo. Removing such setting also removed the huge b/w degradation I was observing after 15" and also makes the initial ramp-up to full b/w faster. Why do you have such setting in your setup? Could you please experiment with such setting off/default tcp_notsent_lowat (4294967295) ? Note: the old comment wrt mptcp wmem autotuning is still relevant, addressing that will require some patch and more testing |
Ah right, so I think that came from some testing I was doing at some point of a cloudflare "tcp collapse" patch based on this blog: https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-latency I clearly forgot to remove it.... sorry! I actually just quickly reapplied that patch and tested with "net.ipv4.tcp_collapse_max_bytes=6291456" but the results were as before. It does seem like BBR does not like low tcp_notsent_lowat... But I still only only saw degradation on mptcp flows with low tcp_notsent_lowat not standalone tcp flows. While I was not able to exactly replicate the extreme BBR performance degradation you see, I set tcp_notsent_lowat=4294967295 (maximum) anyway, and now I do indeed see improvement in the mptcp case:
It's still half of the ~5.68gbit aggregate I can get with each path but its almost twice what I was getting before. Which is probably about right if the TCP window size (128M) is split between the two paths? I still see quite high sndbuf_limited on each of the tcp connections though. Then I retested your suggestion of tcp_[rw]mem="8192 536870912 536870912" and again I now see slightly better performance. In particular, it starts off bouncing around a bit for the first 4 seconds, but then the iperf settles down to a steady ~3.5gbit (up from 3.0gbit with default auto tuning).
If I again double the max window (net.ipv4.tcp_adv_win_scale = -1), I can now get 4.5gbit with mptcp which I was not able to do before. So it seems like 80% of my woes were self inflicted (tcp_notsent_lowat) and 20% is the mptcp wmem autotuning. I really need to stop reading cloudflare blog posts... thanks for walking me through it. |
I think the problem can be addresses at the mptcp impl. level, basically we need to ignore subflow level tcp_notsent_lowat, and use it only at the msk/mptcp socket level. The subflow level tcp_notsent_lowat easily fools the mptcp scheduler to take suboptimal decisions and does not really directly related with the latency observed by the sender.
The wmem is accounted both on the egress subflow and the mptcp socket: to be able to fill 2 subflows, the mptcp sockets would need a sndbuf equal to the sum of the subflows' sndbuf. I have hacky patch for that I hope to share somewhat soon.
AFAIK cloudflare has quite a lot of cool stuff there, the problem is that mptcp is a bit 'special' ;) Thanks for testing! |
I posted on the devel ML a few patches witch should:
Overall you should see some nice improvement while running your test-case with the settings described above (no need to drop tcp_notsent_lowat, no need to increase further the tcp wmem). The patches are available here: https://lore.kernel.org/mptcp/cover.1694710538.git.pabeni@redhat.com/T/#t and/or via pw: https://patchwork.kernel.org/project/mptcp/list/?series=784264 as usual, any feedback more then welcome! |
Updated version at: https://patchwork.kernel.org/project/mptcp/list/?series=784615 |
Thanks, but is there another dependent patch series I need for v6.5 too? While this patch applies, I get compile errors. I removed all the mptcp connection hang fix patches so only this patch series (mptcp-misc-improvement.patch v3) was applied and I get:
Perhaps I need a later version of the "mptcp: fix another close hang-up" (v5) or "mptcp: fix missing close transition" (v2). I'll try those.... |
The above looks like a bad conflict resolution while applying the patches; 'mptcp_schedule_work' must not be touched by the last series. __mptcp_propagate_sndbuf() should be invoked only by mptcp_sk_clone_init(). On top of my head I don't think the series needs additional pre-req. I suggest double checking the diff on top of v6.5 vs the original patch ( |
Yea, I had fuzz=2 set in the build script from a previous hacking effort which is why the patch was applied. So the failing hunks against v6.5.3 are:
because v6.5.3 still has "#define MPTCP_RESET_SCHEDULER 7". Then this:
I can probably fix this one up but I'm not sure what to do about the MPTCP_RESET_SCHEDULER bit. I guess this comes from the BPF scheduler work. https://lore.kernel.org/all/20230821-upstream-net-next-20230818-v1-0-0c860fb256a8@kernel.org/ |
Instead of the failing hunk, add:
It's sufficient to avoid conflict on the bit allocations |
Thanks, I massaged the patch and got it to apply without any patch fuzz. However, I clearly made some mistake somewhere because now the host doesn't last long before it starts spewing errors and hangs.
I have little doubt I made some error somewhere. I'll have another crack at it tomorrow with a coffee in hand. |
I stepped through the various patches as carefully as I could, but there is clearly some instability being introduced by this series. Or it conflicts with the previous patches I have. I might have to wait until things settle down and they are all included in v6.6 or v6.7 to properly test them again. In the meantime, we are using tcp_[rw]mem="8192 536870912 536870912" for the time being and getting good performance. |
Indeed I reproduced a problem similar to the one reported above. AFAICS the root cause is actually preexistent, but the new patches make it much more easier to trigger (while before was very hard to be hit - so hard that syzkaller never triggered that). I shared a new version of the series, comprising a fix for the problem I observed: https://lore.kernel.org/mptcp/cover.1695200723.git.pabeni@redhat.com/T/#t |
Yep, I think that latest series has done the trick - thanks! With the previous series I was able to crash one of the hosts within a couple of minutes of running iperf tests. I have been running tests for over 20 minutes now without incident. I am also seeing good performance without needing to set initial wmem high - auto tuning seems to be doing the right thing now. The sndbuf_limited is a lot lower on each path:
|
Just to follow up - the production hosts have been stable and performance is consistently good. Many thanks! |
all the required changes merged upstream: commit 8005184
|
Hi,
I have a question about tcp vs mptcp performance. I have read previous issues on this subject like #307 (#332 & #345) and I have a rudimentary understanding of the problems of HOL blocking and the extra buffer requirements (window sizes) for mptcp.
Even still, could someone perhaps have a peek at some of the performance comparisons I have been doing and verify if the performance I am seeing makes sense within the current known mptcp limitations?
My setup is as described in (#430) where we have serverA and serverB with three connection pairs between them - one to initiate connections and be the backup, and the other two which use different ISPs to route between them.
Unlike in #307, these bulk transfer pairs are long fat networks (multi gbit) with ~281ms of latency each (london - sydney). So in order to achieve good bulk transfer speeds we crank up the buffers accordingly. Our relevant sysctl settings at both ends:
So we should be able to scale to a max window size of 128MB per connection all being well.
Now, if we run single stream iperf3 tests simultaneously across both our paired connections (ens225 & ens256) using just TCP, we get an idea of the sustained performance possible:
So around 1.84 + 3.34 = 5.68gbit of aggregate performance. The max possible with the ens224 pair is ~2.5gbit and for the ens256 pair it's ~7gbit. It is possible to get closer to the later if we again double the buffer to 256MB but more on that later.
Now, let's compare with a single stream of mptcp:
The standout differences on the TCP subflows is the higher retrans, lower cwnd and the appearance of the sndbuf_limited (41%) field. It is also worth noting that there is very little latency difference between both paths (280 vs 281ms).
My question is, does this seem like one of the known issues with the current mptcp (packet scheduler)? I might read this as saying we are being HOL blocked and/or re-transmitting such that iperf is delayed and is then not filling the send buffer?
Now, in production we have many streams of mptcp active at the same time and they do a much better job of filling the pipes but I am also interested in this worst case single stream performance difference too.
It is also worth noting that further increasing the buffer and max window size to 256MB, further improves the fatter ens256 performance up to 6.5 gbit for a single stream, but does little to improve the mptcp case. We just don't seem to be TCP buffer/window limited by that point.
Anyway, this is all very far down the list of important and great things youz do, so feel free to mark as such - it's mainly to satisfy my own curiosity and interest.
I should also say I have never tried the out-of-tree mptcp implementation but I am led to believe it has more scheduling options which better close the performance gap on aggregated individual TCP vs MPTCP conjoined subflows.
The text was updated successfully, but these errors were encountered: