-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potentially high memory usage via qdisc if fq-codel is too slow in dropping? #1306
Comments
Do we actually need fq_codel on any other than the physical interfaces? |
didn't "we" do that by setting the bytes for fq_codel to "200" and there was no OOM anymore? but nonetheless, i don't think T_X is focussing on a single thing and i also don't think there is one thing that is responsible for everything, rather a set of issues. |
Yeah, that's the tricky part about it. If it's not occuring by disabling/enabling X anymore does not mean that this is the root cause. Could mean that we just slashed the tip of the iceberg. Keep in mind that fq-codel is != fq-codel here. The generic fq-codel in the Linux kernel in the qdisc framework is used for eth0 and mesh-vpn right now. You can spot that via "tc" for instance. ath9k however, does it's own thing again, it has its own, downsized fq-codel implementation. The observation I'm referring to in this ticket is related to the qdisc implementation, not the ath9k one. I haven't had a too closer look at the ath9k implementation yet. But from skimming that code, the memory-limit enforcement seemed more strict than in the qdisc variant. I backported the tc patch in the debug-oom2 branch if anyone wants to play with that. However the tc patch is a bit confusing. Upstream implemented it in a way that "memory_usage" attribute is only shown in tc if >0 bytes are used. And I'm still looking into adding some configurable mdelay()s in front of the fq-codel enqueuing/dequeuing. To have a look at what some configurable load/delay would do the memory usage. To verify or dismiss the hypothesis from the ticket description. (but if anyone would like to play with debug-oom2 branc, tc and a traffic generator like mausezahn for instance already, then go for it!) |
Good question. Some years ago, with tinc I was told that "ifconfig txqueuelen" would do no difference for tap/tun interfaces. In theory, the kernel could probably use some queuing to batch packets for less kernel-userspace context switches and by that improving performance. But the kernel's tap implementation probably still does not do that. Someone should verify that replacing fq-codel with a noqueue policy on mesh-vpn has no performance penalties. |
@T-X what's the plan here now, do you still need any tests? |
we had a commit in the meantime which lowers fq codel values: |
@T-X how to proceed? no reaction for 16 months ;-) |
Disclaimer: Tests to verify the assumption still needed.
From reading through the fq-codel code in net/sched/sch_fq_codel.c I'm starting to have the impression that the fq-codel memory limit (at least for mesh-vpn and eth0, wlanX/phyX uses a different implementation) is not a hard constraint but more something it tries to adhere to.
That is, I'm worried that under heavy load (ref. #1304) and many packets incoming (fastd can be fairly busy if some IP scanner / NS/ARP spammer is active) that the CPU might fail to catch up with the memory limit. Even with the 64 packets batch drop patch.
sch->limit in fq_codel_init() is set to 101024 packets. Which in the worst case might lead to 101024*1500B = ~15MB of memory usage? Actually, even 2x 15MB, once for the fq-codel queue for eth0 and once for mesh-vpn. (the FQ implementation for ath9k seems a bit different to me, it looked like the memory-limit is more strictly ensured there)
Ways to potentially test this theory:
PS: Some more added debug output to net/sched/sch_fq_codel.c seems to indicate that the flow arrays have a reasonable size, even though the interesting way of kmalloc and then retrying vmalloc() is used. Those arrays require 49152+4096 times two (once for mesh-vpn, once for eth0) bytes only. Also, some added assertions in the code did not detect any integer over- or underflows for the memory_usage variable for me so far.
The text was updated successfully, but these errors were encountered: