-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System instability above specific point #753
Comments
Which version of batman-adv do you use? |
We use the batman-adv-15 package from gluon. So it is batman-adv 2015.1 |
Freifunk München also had problems like yours in 2015 with a similar node count - so they split their network in three segments. afaik they didn't do such a detailed analysis as you did and i was told today the problems start to begin again because of the growth of the segments. |
Thanks for the all the work. |
Hey @rotanid thanks for this informations. This makes it even more important to finde the bug. The problem is, that spliting the network is only a workarround for a important problem. So I think we should find and solve this bug. @Adorfer I know your point and now please don't tell me this every time. This is not the way how new solutions are found. We just try to find a solution for problems instead running from on workarround to the next on. We would also like to experiment with the technologie and find such limitations to find a new way to handle them. And yes, we all know about the limitations of a batman-adv l2 network this is the reason why alot of people are experimenting with new solution. But they need there time to be good and stable. So please let uns discusse the problem here without your focus on small community networks. An hey, it looks like a software bug and such bugs can be fixed. |
Even if the workaround helps alot ... a solution would be great! |
That sounds plausible to me. We've seen a similar behaviour in the Regio Aachen network when we reached about 900 Nodes with 3000 clients. We thought that the mesh table just got to big for the little routers. A remarkable point was, that a strong offloader in front of these little devices protected them. Maybe because the table got simpler. This got us around to 1.100 nodes with 3.500 Clients, but network performance was dropping. (A few month ago we finally splitted our network into many small networks using multiple fastd instances attached to different batman devices. One Firmware, multiple white-lists for fastd) |
By the way, we are using gluon-mesh-batman-adv-14. |
Would it be possible to get a dump from dmesg or /proc/vmstat once the issues start to occure on a node? Also a /proc/slabinfo would be great, but seems like it's not available on Gluon images by default. Finally, just to verify that it's a memory issue, the output of /sys/kernel/debug/crashlog would from a node that just crashed and rebooted would be great, too. @nazco: Thanks for the very thorough, analytical report! By the way, one more things which next to batman-adv and the IP neighbor caches needs memory relative to the number of clients is the bridge. It's forwarding database (fdb for short) keeps a list of MACs behind ports, too. Speaking of the bridge, I noticed that the bridge kernel code does not use kmalloc() for it's fdb entries, but kmem_cache_*() function calls. Maybe we are having similar issues as we had with the debugfs output until the added their fallback from kmalloc() to vmalloc(). Which is a very fragmented RAM. Could be interesting whether kmem_cache_alloc() might not just speed up memory allocation but might also help getting a less fragmented RAM (if that's the issue here). Regarding the VLANs, that indeed sounds odd. I queried ordex, the guy behind the TT and its VLAN support on IRC. Btw. I just checked in a VMs with one isolated node and no matter with or without VLANs, I'm seeing a weird, additional local TT entry with VID 0 which has the MAC address of bat0. Do you have VID 0 entries without the P flag? How many have VID 0, how many VID 1 exactly? |
Hey, thanks for the reply. I'll try to get these info for you. |
And one more thing which would be interesting: Running wirerrd to see whether something weird happens on the network when the load is high. Currently I'm running this for Freifunk Hamburg and Freifunk Lübeck and that's usually one of the first places we look at when something behaves oddly. |
Regarding the process table, I'm currently wondering about two things:
|
The process table in my initial post only shows the diffs to the 2015.2 firmware of hamburg
Yes, the device is running 2016.1.3 |
@T-X, I think these numbers are virtual memory (as that is the number shown by ps). The new respondd still uses about 2 MB of virtual memory, as it uses a dlopen a lot (and at least uClibc will use a lot of virtual memory per dlopened object). @nazco, if the numbers you compared are virtual memory, they are meaningless, as virtual memory is often never actually allocated. AFAIK, the number VmRSS in I don't think any of the processes make much difference, the most important change from Barrier Breaker to Chaos Calmer is the newer kernel. I think the new kernel might work a bit worse under memory pressure, although it's hard to tell for sure. |
Unfortunately, kmem_cache_alloc() isn't really documented in the kernel. So we are unsure whether it'd help in anyway with this problem. It seems that from looking at other parts in the kernel, that it is common to use dedicated caches for larger amounts of objects which are frequently changing. Would anyone be willing to give this patch a try to see whether it makes any difference? https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/2016-May/015368.html |
Btw., regarding the fiddling with the neighbor tables in the first post. @ohrensessel and I noticed yesterday after applying #674 and #688 that the multicast snooping takes place before any additions to the bridge forwarding database. For instance, "$ bridge fdb show" had no more entries towards bat0, while before it had one MAC entry for nearly every client in the mesh. For the IP neighbor tables it should be similar. @ohrensessel wanted to test and observe further, whether having these two patches makes any difference for the load peaks at Freifunk Hamburg in the evening. He'll probably report back later. |
How could that be an explanation for the strict limit of 3000 entries in the transglobal-table? |
@bitboy0: There is no strict limit for the global translation table. There is just a limit for the local translation table of a node (= the number of clients a node can serve; ~120 with batman-adv 2013.4, 16x that much with a recent version of batman-adv / since fragmentation v2 ). The reports so far, backed by the observation that only 32MB devices are affected, seem to point to a simple out-of-memory problem on such devices (though I'm still waiting for a /sys/kernel/debug/crashlog or dmesg output from someone to confirm). When a device starts to get low on memory, then the Linux kernel memory allocator will have more and more trouble to serve requests and might even need to move some objects to be able to have consecutive, spare memory areas available again. Thus resulting in high load first and at some point even a reboot. In my x86/amd64 VMs with many kernel debugging options enabled, a global TT entry had allocated about 200 bytes. Sven has mentioned a raw size of 48 bytes on OpenWRT ar71xx. Which will probably be aligned to 64 bytes. So 4000 entries times 64 bytes would result in about 250KB of RAM usage. Which doesn't seem much. Of course, if the RAM is already full through kernel and userspace programs, then even a few additional hundred KB in the afternoon/evening through the batman-adv global TT, the bridge forwarding database or IP neighbor tables might be the straw to break the camel's back. |
Regarding that hypothesis, it might further be interesting, whether:
|
@T-X From the uptime of my little WA701ND i can see, that devices are affected immediately ( <10mins after restart ). I added serial connection to some nodes, and try to get the remaining logs before crash. |
Thanks @Nurtic-Vibe! With "better behaviour", what do you mean by that exactly? No more Out-of-Memory-Crashes or less often? What do you mean by "this is only a workaround"? If a high static memory footprint (85% were mentioned in the initial post) were the issue, than reducing that would be a valid fix, wouldn't it? Btw., it's probably not that well known, but OpenWRT has a great feature to preserve crashlogs over a non-power-cycled reboot. After a crash & reboot you should have a new file in /sys/kernel/debug/crashlog. So it would be great if anyone, even with no serial-console access, could have a look at that after a crash. |
And one more question @nazco: For the 85% you mentioned, what does the graph show if you are running the same node just like that but cut the uplink? What is the memory footprint without this node seeing the rest of the network? It'd be interesting to test whether it stays relatively high even without any other mesh participants. That could back or dismiss the too-much-static-memory-usage theory. |
@T-X with haveged disabled we get OoM crashes less often, but they still occur regulary. |
Looking at @nazco's broken.log.txt again, @NeoRaider, do you know whether the vmalloc fallback for debugfs access made its way to Gluon yet? It seems that batman-adv-visd accesses the global translation table via debugfs first to translate the alfred server's MAC address to an originator address. Which then results in yet another debugfs originator table lookup to check the TQ and determine the best alfred server. To inform others (@NeoRaider found this issue a while ago): Without the vmalloc fallback, that results in the need of a large consecutive memory area to be allocated upon accessing a debugfs file. The allocation size happens in a stupid first try x bytes and if while copying it turns out to be insufficient, double it and copy again. That could explain why a certain threshold of global TT entries might cause a jump in load times. If all that were the case, then it'd be a mixture of: High static memory usage, many small, scattered allocations in the remaining memory. Which makes trouble for a large, consecutive allocation for debugfs access. |
@T-X, the vmalloc patch is included since Gluon v2016.1. |
@T-X one of my nodes just crashed but there is no file like /sys/kernel/debug/crashlog |
@T-X No, the bigger nodes do have the same RAM-eating Problem. Due to the fact, they have more RAM they don't care. But the bug itself is the same. With "strict limit" I don't say that there is a visible limit like maximal table size, but the problems occure if the this specific number of entries is in the list. Maybe better to say: the bug is only visible if the TG have 3000 entries. And the Problem start imidiately at all nodes in the network the same time if the "limit" is reached. So some knodes can't even get back to work, but they restart directly after a reboot again and again. |
@T-X better behavior will say: the addidional space made by disable and stop haveged gives slightly more room for alloc. Because of the sysctl-changes prevents the kernel to write dirtied blocks with high priority it can simply handle the lack of memory more smoothly. This doesn't stop the Problem, but the kernel can manage that longer before OOM-killer triggers a panic. |
@nazco: hm, okay, thanks. And you didn't power-cycle the device, right? Then maybe crashlog is unreliable in some OOM cases :(. Keep looking out for it though :). Btw., you can easily check whether your OpenWRT image supports crashlog by triggering a crash through "echo c > /proc/sysrq-trigger". The device should then reboot and there should be a new file in /sys/kernel/debug/crashlog (until you reboot again or power-cycle it). I also just tried simply doing a "dd if=/dev/urandom of=/tmp/foo.bin" and after a few seconds the NanoStationM2 with a Freifunk Hamburg image rebooted here. Then I had a nice Out-of-Memory trace in /sys/kernel/debug/crashlog. Here's the crash before any uplink connectivity: crashlog-841-no-uplink.txt. Though the userspace programs do not seem to show any suspiciously high memory usage, at least at that point of time (taken between 19:00 and 20:00). |
Interesting: For a Freifunk Hamburg node with currently 3370 clients (batctl tg | wc -l) the byte count currently is 259407 (batctl tg | wc -c). Which is very close to 2^18. Not sure whether that might still be a relevant number with the vmalloc patch for debugfs. |
@bitboy: Various changes were made during these four weeks to reduce memory usage on the kernel side, which will hopefully trickle into Gluon soon:
Until this lands in a Gluon release, @bitboy0, would it be possible for you to give a recent batman-adv/batctl/alfred master branch and #780 a try and report back your new limits? PS: Also, I'm still a little suspicious towards the new FQ-Codel. That's one more change that came with the more recent Gluon versions. And FQ-Codel is about queueing, which means it is about memory. Maybe it needs more memory in order to achieve it's incredible performance/latency improvements. (there seems to be a /proc/sys/net/core/default_qdisc, but not sure right now what it's value was prior to fq_codel) |
And there is this ticket on OpenWRT still: https://dev.openwrt.org/ticket/22349 Can someone with an affected device try the patch mentioned there, "fq_codel: add batch ability to fq_codel_drop()" that is? Also looks like it is possible to play with FQ-Codel parameters via tc (e.g. the "flows" and "limits" parameters): https://lists.openwrt.org/pipermail/openwrt-devel/2016-May/041445.html |
@T-X I never compiled gluon by myself till now. I will try mey very best and thanks for the Informations! |
havent looked much in the fq_codel yet - in freiburg we have also this issue (while really lower numbers in nodes (330++) and clients(900++) but complex connected network of 10 supernodes) which result in this script edit: which is basicly the same patch as ffrn used (couldnt find it before) |
Hey, do you have further information about the characteristics you observe and that led you to the conclusion that it could be the same bug? |
not really, i would love to fact is, that we observe for a long time a rising in reboots (up to several times a day) on 841 (or weak devices) while other nodes (same weak class of device) seem not affected, and some days later are affected , and than not... edit only minor thing, we have a test in one mesh-cloud with bigger mcast rate - there the routers reboot very often. The local router density is high. we dont have detailed usage of ram over time, or load over time, just the observing that there is nothing out of the ordinary and some minutes later one is rebooting again. we have rather complex backbone with many bridged if on the supernodes, resulting in big originator tables on the supernodes (while then nodes could be reached equally good from all bridged supernodes) ... this should (so i think) should have no effect on the router, while there is nothing like that. now i want to test this on some routers around the city, and see if they reach a uptime of a several days or not edit2 the script does not help anything, some of our group think it could be some issue with network and a bunch of unaligned memory access (there is plenty on the routers... except from havin an vague idea of it , this is beyond my (c/assembler) knowledge) .. watch this number raise into millions |
To the comment from #753 (comment): My gluon 2016.1.6-based repository now has:
Interested people can just try https://github.com/FreifunkVogtland/gluon/tree/v2016.1.6-1 when they think that the memory usage caused by debugfs is the culprit behind this problem. |
@T-X, the fq codel stuff can really take a lot of memory. We should check out following patches for 2016.2 to reduce the impact with the new wifi driver:
These things were used to fix OOM problems in a test I did with Toke (using a 32MB device and 30 clients). Maybe reducing the limit for the qdisc might be a also possibility which can be tested because this is the part which is already in 2016.1.x. For example right now on LEDE it is using only 4Mb per qdisc:
(The output was generated with the include/linux/pkt_sched.h part of https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/patch/?id=31ce6e010195d049ec3f8415e03d2951f494bf1d + https://patchwork.ozlabs.org/patch/628682/raw/ applied on iproute2.) But OpenWrt CC doesn't yet have this memory_limit implementation because it was first introduced in 95b58430abe7 ("fq_codel: add memory limitation per queue"). So backporting the patches from LEDE (033-fq_codel-add-memory-limitation-per-queue.patch + 660-fq_codel_defaults.patch) could also be a good idea. |
I think it would be better not to use fq-codel as long as we're on CC, there are too many fixes that would need to be backported... I've avoided to backport https://git.lede-project.org/?p=source.git;a=commitdiff;h=c4bfb119d85bcd5faf569f9cc83628ba19f58a1f , so fq shouldn't be effective for mac80211 anyways; I have no idea though if this shortcut in |
Ok, didn't check whether fq_codel was active for the mac80211 queuing with 2016.2. So we can forget the point about the wifi driver and its internal queuing. But just for clarification: fq_codel is still used as default queuing discipline on OpenWrt CC (and thus it most likely is also used by gluon 2016.1.x/2016.2.x). So the patch 033-fq_codel-add-memory-limitation-per-queue.patch + 660-fq_codel_defaults.patch may still be interesting for OpenWrt CC (2016.1.x and 2016.2.x) to reduce the chance that the normal qdiscs take up too much memory. |
There are existing patches on "https://github.com/Freifunk-Rhein-Neckar/ffrn-packages/tree/master/ffrn-lowmem-patches" which help Nodes with small RAM to stay stable. While discussing if we will include the patches in our FFS-Firmware we are wondering, why the patches are not included in the official Gluon code base, after there exists good experience on FFRN. Are there specific reasons? |
@FFS-Roland the simplest reason might be: no one created a pull request to include them. |
Well we've developed them as a workarround for some problems we see in our (big) network. So we are not aware of some side effects this options might have in other setups. This is the main reason why we don't create a PR at the moment. If you see good results in Stuttgart too, than I think we can talk about a regular PR with this patches. |
Meanwhile we were testing patched Gluon 2016.2.1 with WR841N and found some side effects of the sysctl-modifications. Nodes (not clients) cannot be accessed reliably by IPv6, and CPU load rises up. Therefore we will not use the complete patch in our build, but haveget related part only. |
I'm surprised the neighbor table garbage collection in that patch set helps at all, because nodes should not have to manage so many neighbor entries anyway. My node currently has 25 entries for IPv6 and 3 for IPv4, in a mesh with 650 nodes and 1000 clients. |
Our last discussions in Stuttgart seem to result in not using the patch at all, because disabling haveged will limit entropy on the nodes significantly. So we will not profit of gaining 1 MB of RAM, but with our reduced subnet sizes we guess not running into trouble. |
@Nurtic-Vibe reported on IRC that even FFRN doesn't use the lowmem-pkg anymore as it doesn't help much. |
closing in favor of #1243, although this one also describes problems of which some are already solved. |
As I've reported on IRC we at FFRN can observe a higher load and frequent reboots on nearly 30% of our nodes.This happens if we reach a specific number of nodes and clients in our network. We have debugged this issue for over a month now and we think we've limited the possible sources.
Let's start with our observations. The first time we became really aware of this problem was the moment we reached the number of 1500 clients in our network spread over nearly 700 nodes. This happened around the first of April this year. But after analyzing the problem we think it started even earlier with some "random" reboots we were analyzing too.
The first thing we observed was, that the majority of the affected nodes are the small tl-wr841 devices. This does not mean, that the bigger nodes like a tl-wr1043 are not affected, the problem just hasn't a big enough impact on them. But interestingly not all of these nodes where affected, only a portion of 30% is showing all the characteristics of this problem. All other nodes are running without any interruptions.
On an effected node we can see the following: If we reach more that 3000 entries in the transglobal table the problems start and if the number is falling under this mark the problems are mainly gone. Such a node shows a increased average load of around 0.45 to 0.9 compared with 0.2 to 0.25 of an not effected node. But the load also starts to peak to values of 2-4 in the time we are above the mark on the problematic nodes. And sometimes every few hours or every few minutes the node reboots.
Another interesting observation is, that affected nodes get alot of more free RAM when the problems start. The RAM usage decreases from the healthy default of around 85% (on a 841) to 75%-80%.
On a TL-1043 it looks like this:
On a TL-841v9 it looks like this:
At all times we can't see a single process making problems or using more RAM than usual, only the load and the system CPU utilization showed that something went wrong. So we though that the problem has to be in the kernel or in combination with the RAM.
So we started to debug the problems and first we tried to locate a pattern in our statistics to limit the number of possible sources for the problem. There were alot of other ideas we tested but all with nearly no effect. So the most promising was the TG table. But it's not the number of entries because we can't find any limits near this number in the sources and also some other things are speaking against it. So the problem has to be in the processing of the entries or somewhere else.
After that we found out that something in combination with the TG tabel, we think it's the writing of the table to tmpfs, was causing new page allocations. These page allocation couldn't be satisfied by
the available RAM so some parts of the page cache are cleaned up. This cache holds the frequent running scrips of the system so after that the system has to start rereading them. And here the first problem starts. The system is rereading the disk without an end. I've attached a log file for a affected and a not affected node.
notbroken.log.txt
broken.log.txt
If this continues for a while and then we try to write again our TG table, it could be that we run in the vm.dirty_ration which blocks the IO of all processes making everything even worse.
So to solve this problem we started optimizing alot of parameters in the sysfs. He is a list of all current additional options we set.
Here we save some RAM by using smaller neighbour tables, we increased the min_free_kbytes value to have a bigger buffer for allocation problems, we set down the dirty_background_ration so the system starts writing stuff to disk in background earlier (this is no problem because we write to a ramdisk) we set up the dirty_ration to prevent a whole IO lock and the dirty_expire_centisecs means that we write back the stuff only when we reach the background_ration and not after a time limit to prevent useless write.
With this changes we could increase the performance, we have decreased the load of effected nodes even under the average value for not effected nodes. So maybe some of these options are also relevant without this issue. To get even more free ram some people tried to disable haveged, this also makes it more stable, because we have more free RAM.
Then we saw that the community in Hamburg has a bigger network with alot more clients and nodes but doesn't look affected by the problem. So we analysed the site.conf and found out that the mesh VLAN feature we were using (Hamburg not) was causing double entries for every node. One with VID -1 and one with VID 0. This isn't that great.
Then we flashed a test node with the firmware from Hamburg. The first difference is that our firmware is based on gluon 2016.1.3 and the one from Hamburg is the version 2015.1.2. So there are a few differences.
But short back to the TG table. The TG table from Hamburg was around 3700 entrys long without the problem occurring. So the problem must be something that was changed between the versions. As the 2016 versions are based on OpenWRT CC and not on BB like 2015 this could be alot. But we think its not something on the OpenWRT base system. It has to be something more freifunk specific.
So we looked again on the process list of a node with our firmware and one with the firmware from Hamburg.
Her we found the following differences (First value for FFRN, second for FFHH)
This means an increased RAM usage with the newer firmware of over 2MB. But this couldn't be the only source too, because we have nodes without the problem.
Then we started thinking about what we have, and also started writing some documentation of the work for the community. Here we get the idea that the sudden decrease of RAM usage maybe could be caused by the OOM killer. And also the characteristics we see in the RAM graph showed some characteristics of a memory leak. But again this couldn't be the only problem. So we thought again a little bit further and now think it's a combination of mem leak and a mem corruption causing the rereading of the flash storage without an end. With all these information we think that the only service that is really near to the problem is batman-adv-visdata, so this would be the first point to go deeper. But here we come to a limit in resources an knowledge about the system and hope that we find someone who can help us find a solution for this problem.
We know that this are alot of information and maybe alot of information is missing. Please ask if you need something.
You can find a german version including the discussions here: https://forum.ffrn.de/t/workaround-841n-load-neustart-problem/1167/29?u=ben
The text was updated successfully, but these errors were encountered: