Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially high memory usage via qdisc if fq-codel is too slow in dropping? #1306

Closed
T-X opened this issue Jan 13, 2018 · 8 comments
Closed
Labels
0. type: bug This is a bug 0. type: enhancement The changeset is an enhancement

Comments

@T-X
Copy link
Contributor

T-X commented Jan 13, 2018

Disclaimer: Tests to verify the assumption still needed.

From reading through the fq-codel code in net/sched/sch_fq_codel.c I'm starting to have the impression that the fq-codel memory limit (at least for mesh-vpn and eth0, wlanX/phyX uses a different implementation) is not a hard constraint but more something it tries to adhere to.

That is, I'm worried that under heavy load (ref. #1304) and many packets incoming (fastd can be fairly busy if some IP scanner / NS/ARP spammer is active) that the CPU might fail to catch up with the memory limit. Even with the 64 packets batch drop patch.

sch->limit in fq_codel_init() is set to 101024 packets. Which in the worst case might lead to 101024*1500B = ~15MB of memory usage? Actually, even 2x 15MB, once for the fq-codel queue for eth0 and once for mesh-vpn. (the FQ implementation for ath9k seems a bit different to me, it looked like the memory-limit is more strictly ensured there)

Ways to potentially test this theory:

  • Backport the memory-usage/limit patch for iproute2/tc and monitor from userspace if we are frequently overshooting the memory limit on busy networks and/or on devices with a slower CPU.
  • Try whether increasing the batch drop count from 64 to 128 packets makes a difference regarding the load and OOM issues described in High load on some devices after v2017.1.x update #1243.
  • Try whether lowering the qdisc limit from 10*1024 to maybe 1024 makes a difference regarding the load and OOM issues described in High load on some devices after v2017.1.x update #1243.

PS: Some more added debug output to net/sched/sch_fq_codel.c seems to indicate that the flow arrays have a reasonable size, even though the interesting way of kmalloc and then retrying vmalloc() is used. Those arrays require 49152+4096 times two (once for mesh-vpn, once for eth0) bytes only. Also, some added assertions in the code did not detect any integer over- or underflows for the memory_usage variable for me so far.

@rotanid rotanid added 0. type: bug This is a bug 0. type: enhancement The changeset is an enhancement labels Jan 14, 2018
@christf
Copy link
Member

christf commented Jan 21, 2018

Do we actually need fq_codel on any other than the physical interfaces?
Also: Your (much appreciated) research seems to concentrate on fq_codel. Can we show that actually the OOM does not occur on a recent gluon-build without fq_codel?

@rotanid
Copy link
Member

rotanid commented Jan 21, 2018

Can we show that actually the OOM does not occur on a recent gluon-build without fq_codel?

didn't "we" do that by setting the bytes for fq_codel to "200" and there was no OOM anymore?

but nonetheless, i don't think T_X is focussing on a single thing and i also don't think there is one thing that is responsible for everything, rather a set of issues.

@T-X
Copy link
Contributor Author

T-X commented Jan 22, 2018

Yeah, that's the tricky part about it. If it's not occuring by disabling/enabling X anymore does not mean that this is the root cause. Could mean that we just slashed the tip of the iceberg.

Keep in mind that fq-codel is != fq-codel here. The generic fq-codel in the Linux kernel in the qdisc framework is used for eth0 and mesh-vpn right now. You can spot that via "tc" for instance. ath9k however, does it's own thing again, it has its own, downsized fq-codel implementation.

The observation I'm referring to in this ticket is related to the qdisc implementation, not the ath9k one. I haven't had a too closer look at the ath9k implementation yet. But from skimming that code, the memory-limit enforcement seemed more strict than in the qdisc variant.

I backported the tc patch in the debug-oom2 branch if anyone wants to play with that. However the tc patch is a bit confusing. Upstream implemented it in a way that "memory_usage" attribute is only shown in tc if >0 bytes are used.

And I'm still looking into adding some configurable mdelay()s in front of the fq-codel enqueuing/dequeuing. To have a look at what some configurable load/delay would do the memory usage. To verify or dismiss the hypothesis from the ticket description. (but if anyone would like to play with debug-oom2 branc, tc and a traffic generator like mausezahn for instance already, then go for it!)

@T-X
Copy link
Contributor Author

T-X commented Jan 22, 2018

@christf:

Do we actually need fq_codel on any other than the physical interfaces?

Good question. Some years ago, with tinc I was told that "ifconfig txqueuelen" would do no difference for tap/tun interfaces. In theory, the kernel could probably use some queuing to batch packets for less kernel-userspace context switches and by that improving performance. But the kernel's tap implementation probably still does not do that.

Someone should verify that replacing fq-codel with a noqueue policy on mesh-vpn has no performance penalties.

@flobeier
Copy link

@christf @T-X

As discussed in #gluon: I ran a few iperf3 tests in both directions and could see no difference between fq_codel and no queue on mesh-vpn. Ping times didn't increase and no packets were dropped.

@rotanid
Copy link
Member

rotanid commented Feb 17, 2018

@T-X what's the plan here now, do you still need any tests?

@rotanid
Copy link
Member

rotanid commented Jun 5, 2018

we had a commit in the meantime which lowers fq codel values:
46c5eff "gluon-core: reduce mac80211 fq_codel memory limit to 256KB on devices with 32MB RAM"

@rotanid
Copy link
Member

rotanid commented May 29, 2019

@T-X how to proceed? no reaction for 16 months ;-)
maybe @NeoRaider has something to say about the results @H4ndl3 had?

@mweinelt mweinelt closed this as completed Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug 0. type: enhancement The changeset is an enhancement
Projects
None yet
Development

No branches or pull requests

5 participants