Potentially high memory usage via qdisc if fq-codel is too slow in dropping? #1306

T-X · 2018-01-13T11:36:41Z

Disclaimer: Tests to verify the assumption still needed.

From reading through the fq-codel code in net/sched/sch_fq_codel.c I'm starting to have the impression that the fq-codel memory limit (at least for mesh-vpn and eth0, wlanX/phyX uses a different implementation) is not a hard constraint but more something it tries to adhere to.

That is, I'm worried that under heavy load (ref. #1304) and many packets incoming (fastd can be fairly busy if some IP scanner / NS/ARP spammer is active) that the CPU might fail to catch up with the memory limit. Even with the 64 packets batch drop patch.

sch->limit in fq_codel_init() is set to 101024 packets. Which in the worst case might lead to 101024*1500B = ~15MB of memory usage? Actually, even 2x 15MB, once for the fq-codel queue for eth0 and once for mesh-vpn. (the FQ implementation for ath9k seems a bit different to me, it looked like the memory-limit is more strictly ensured there)

Ways to potentially test this theory:

Backport the memory-usage/limit patch for iproute2/tc and monitor from userspace if we are frequently overshooting the memory limit on busy networks and/or on devices with a slower CPU.
Try whether increasing the batch drop count from 64 to 128 packets makes a difference regarding the load and OOM issues described in High load on some devices after v2017.1.x update #1243.
Try whether lowering the qdisc limit from 10*1024 to maybe 1024 makes a difference regarding the load and OOM issues described in High load on some devices after v2017.1.x update #1243.

PS: Some more added debug output to net/sched/sch_fq_codel.c seems to indicate that the flow arrays have a reasonable size, even though the interesting way of kmalloc and then retrying vmalloc() is used. Those arrays require 49152+4096 times two (once for mesh-vpn, once for eth0) bytes only. Also, some added assertions in the code did not detect any integer over- or underflows for the memory_usage variable for me so far.

christf · 2018-01-21T20:35:15Z

Do we actually need fq_codel on any other than the physical interfaces?
Also: Your (much appreciated) research seems to concentrate on fq_codel. Can we show that actually the OOM does not occur on a recent gluon-build without fq_codel?

rotanid · 2018-01-21T22:52:21Z

Can we show that actually the OOM does not occur on a recent gluon-build without fq_codel?

didn't "we" do that by setting the bytes for fq_codel to "200" and there was no OOM anymore?

but nonetheless, i don't think T_X is focussing on a single thing and i also don't think there is one thing that is responsible for everything, rather a set of issues.

T-X · 2018-01-22T12:36:28Z

Yeah, that's the tricky part about it. If it's not occuring by disabling/enabling X anymore does not mean that this is the root cause. Could mean that we just slashed the tip of the iceberg.

Keep in mind that fq-codel is != fq-codel here. The generic fq-codel in the Linux kernel in the qdisc framework is used for eth0 and mesh-vpn right now. You can spot that via "tc" for instance. ath9k however, does it's own thing again, it has its own, downsized fq-codel implementation.

The observation I'm referring to in this ticket is related to the qdisc implementation, not the ath9k one. I haven't had a too closer look at the ath9k implementation yet. But from skimming that code, the memory-limit enforcement seemed more strict than in the qdisc variant.

I backported the tc patch in the debug-oom2 branch if anyone wants to play with that. However the tc patch is a bit confusing. Upstream implemented it in a way that "memory_usage" attribute is only shown in tc if >0 bytes are used.

And I'm still looking into adding some configurable mdelay()s in front of the fq-codel enqueuing/dequeuing. To have a look at what some configurable load/delay would do the memory usage. To verify or dismiss the hypothesis from the ticket description. (but if anyone would like to play with debug-oom2 branc, tc and a traffic generator like mausezahn for instance already, then go for it!)

T-X · 2018-01-22T12:40:25Z

@christf:

Do we actually need fq_codel on any other than the physical interfaces?

Good question. Some years ago, with tinc I was told that "ifconfig txqueuelen" would do no difference for tap/tun interfaces. In theory, the kernel could probably use some queuing to batch packets for less kernel-userspace context switches and by that improving performance. But the kernel's tap implementation probably still does not do that.

Someone should verify that replacing fq-codel with a noqueue policy on mesh-vpn has no performance penalties.

flobeier · 2018-01-30T22:43:37Z

@christf @T-X

As discussed in #gluon: I ran a few iperf3 tests in both directions and could see no difference between fq_codel and no queue on mesh-vpn. Ping times didn't increase and no packets were dropped.

rotanid · 2018-02-17T03:15:06Z

@T-X what's the plan here now, do you still need any tests?

rotanid · 2018-06-05T21:49:33Z

we had a commit in the meantime which lowers fq codel values:
46c5eff "gluon-core: reduce mac80211 fq_codel memory limit to 256KB on devices with 32MB RAM"

rotanid · 2019-05-29T02:00:04Z

@T-X how to proceed? no reaction for 16 months ;-)
maybe @NeoRaider has something to say about the results @H4ndl3 had?

rotanid added 0. type: bug This is a bug 0. type: enhancement The changeset is an enhancement labels Jan 14, 2018

mweinelt closed this as completed Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially high memory usage via qdisc if fq-codel is too slow in dropping? #1306

Potentially high memory usage via qdisc if fq-codel is too slow in dropping? #1306

T-X commented Jan 13, 2018

christf commented Jan 21, 2018 •

edited

Loading

rotanid commented Jan 21, 2018

T-X commented Jan 22, 2018

T-X commented Jan 22, 2018

flobeier commented Jan 30, 2018

rotanid commented Feb 17, 2018

rotanid commented Jun 5, 2018 •

edited

Loading

rotanid commented May 29, 2019

Potentially high memory usage via qdisc if fq-codel is too slow in dropping? #1306

Potentially high memory usage via qdisc if fq-codel is too slow in dropping? #1306

Comments

T-X commented Jan 13, 2018

christf commented Jan 21, 2018 • edited Loading

rotanid commented Jan 21, 2018

T-X commented Jan 22, 2018

T-X commented Jan 22, 2018

flobeier commented Jan 30, 2018

rotanid commented Feb 17, 2018

rotanid commented Jun 5, 2018 • edited Loading

rotanid commented May 29, 2019

christf commented Jan 21, 2018 •

edited

Loading

rotanid commented Jun 5, 2018 •

edited

Loading