-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
significantly increased management traffic after upgrade to Gluon v2017.1.x #1446
Comments
It is currently unknown why this happens. I have two main suspects
The effect can only be seen when a significant portion nodes are updated from 2016.2.x to 2017.1.x Regarding the batman-adv first point: Freifunk Vogtland increased its batman-adv version during the update from 2017.0.1+norebroadcast to 2018.1+maint+norebroadcast patches. Rotanid's update changed batman-adv from 2016.2.x+maint patches+norebroadcast to 2017.2+maint patches. I would therefore guess that the culprit (when batman-adv is the reason) has to be searched between batman-adv v2017.0.1..v2017.2 Maybe somebody is able to reproduce the problem and can install a modified gluon 2017.1.8 version on a lot of nodes in his network. It would be good when this person can then check whether returning to batman-adv 2016.2.x (from gluon 2016.2.x) reduces the number of mgmt overhead again. The three patches to revert to the older batman-adv version can be found in the branch https://github.com/FreifunkVogtland/gluon/commits/batadv/2016.2.x |
FYI, my "spikes" are back (like before the upgrade) - the management traffic remains at the higher level |
after updating almost all nodes (over 500) to Gluon v2018.1.x with batman-adv 2018.1, mgmt_tx increased by ~10kbit/s and mgmt_rx by ~40kbit/s, so it increased again, but not nearly as much as last time. |
@ecsv Have there been any new insights regarding this issue? |
At least I received no info and I also cannot provide any |
@T-X Is it possible that this is related to the removal of the no_rebroadcast hack from our batman-adv? |
@NeoRaider: Hm, no, the no_rebroadcast patch should only have touched layer 2 broadcast frames, but not mgmt traffic (OGMs). The no_rebroadcast patch was only applied on "batadv_send_outstanding_bcast_packet()". Similarly, the new, automatic approach in batman-adv is only applied to layer 2 broadcast frames and to BATMAN_V OGMs. But not BATMAN_IV. @awlx: Urgh, yeah, that graph looks ugly... Is that a capture from a Freifunk router or from a gateway? Would it be possible for you to provide a capture from the VPN interface, filtered by OGMs (with wireshark or tshark you should be able to filter by batadv packet type)? |
Also, do you have link to the a site/domain config used before and after the update, @awlx? |
Hi @T-X, we as ffmuc had the following configs: In the hope to reduce that traffic we then disabled ULA and ibss in the hope it would help but the benefits of v2019.0.2 where quite minimal to not observable. Concerning the graph @awlx posted: It was from our grafana: The average reported mgmt traffic of all nodes collected via respondd. So it was only observable for us as the release got rolled out to more nodes... Concerning a capture: I will try to do a capture on the vpn interface. Do you have a link or a simple command I can use to create it? |
Hi @krombel, You can filter just for the mgmt (== OGMs, originator messages) with this command:
The capture filter here means:
(An equivalent display filter for tshark or Wireshark would be |
Hm, weird. Extra wifi interfaces in the new firmware could have been an explanation as the "mgmt" count is the sum of originator messages sent and received over all interfaces used by batman-adv. The mgmt_tx counter for instance is increased here but the caller, batadv_iv_ogm_send_to_if() is called for each outgoing interface. Also when keeping in mind that we send OGMs three times on a wifi interface to compensate potential wifi packet loss vs. once on any other interface (including VPN interfaces) could have roughly matched your total increase, like 3+1=4 / 3+3+1=7 (single radio / dual radio) packets before, 3+3+1=7 / 3+3+3+3+1 = 13 (single radio / dual radio) packets after (disregarding a few other factors like OGM packet aggregation and rebroadcast suppression which depend on the topology). Have you updated all your nodes to your site v2019.0.2 by now? Could you check whether the Gluon upgrade scripts have successfully removed the IBSS interface with your v2019.0.2 upgrade? |
I did some simple, two node tests in virtual machines. At least in this most simple setup there was no difference in OGMs between batman-adv v2017.0.1, v2018.1 and the current master branch: https://gist.github.com/T-X/90cda122ae30ddd5b860a6df0987fc77
I think that's what I would also tend to. Gluon v2016.2.x was pre-FQ-Codel and Gluon 2017 introduced FQ-Codel, right? That would have probably made a difference. Also, at some point the airtime fairness patch was added (to ath9k?) in OpenWrt (which version? And which Gluon version?). Also note, that we should have reduced layer 2 broadcast overhead with the gluon-ebtables-limit-arp, the IGMP/MLD segmentation and the batman-adv broadcast avoidance patches. Which would lead to less pesky, small broadcast packets and therefore less possibilities for wifi packets, including OGMs, to collide with. Ideally to conclude that the increased mgmt packets are caused by increased wifi reliability: Any chance anyone has TQ values before and after their update in their database? An overall average (and/or median) of that before and after the update would be interesting for comparison. |
Ok, we have now had another community reporting an increased OGM/mgmt traffic: Freifunk Kiel with a Gluon v2018.1.4 updated from batman-adv-legacy (compat-14) to batman-adv (compat 15). And have observed a 4x mgmt traffic increase: The interesting thing is that with this update they did not update the Gluon version, just switched the BATMAN variant. Looking at pcap dumps we observed that for one thing the average OGM size has about doubled due to the added TVLVs. The other 2x factor currently seems to point to a too fast OGM interval. The interval is configured to 5 seconds, but in practice they seem to be transmitted at about ~2-3 seconds rates. We have picked two random nodes which had the same behavior. Does anyone else observe a similar behavior? I'll see whether I can write something to measure this in more detail and to make an overall statistic. So far from looking at the code I can't find anything weird yet. Nor any specific changes related to the BATMAN IV OGM scheduler between compat 14 and compat 15. |
And sorry, that was actually wrong. We do the 3x broadcast for broadcast data packets only. And not for OGMs. |
Please test the changes proposed here (changes for all kind of versions and OpenWrt/LEDE versions): https://www.open-mesh.org/issues/380#note-4 |
And for the record, the issue was introduced by batman-adv v2016.3.
Thanks to everyone for the incredibly helpful feedback (graphs, statistics and even Lua evaluation scripts for Wireshark (kudos to @sargon for the latter), ...) and thanks to all these amazing people that are having a watchful eye on how our mesh networks behave. You guys and gals are amazing :-). |
Looking forward to a backport of that patch! |
The patch is now part of Gluon master and comparing a few graphs above I think that fixes it. If for any reason people think otherwise please comment and we can reopen. We should backport this and do a v2018.2.2 release soon. |
We are also seeing significantly reduced (to less than 60%) management traffic in our 600-node-mesh. Thanks a lot to everyone involved in finding this! |
Awesome! Btw, do any communities have load statistics before and after this fix? Ideally I'd be interested in overall load average+median before+after, together with the number of nodes in the domain. And averages/median filtered for 32MB flash devices. (I want to know whether the number of "background packets" has a noticeable impact on a node's overall load.) |
@T-X here you go, I think you can figure out when the update had happened ;-) |
although i don't have it as detailed as you request it, my comment contained a screenshot of the load graph. |
Is the load back to the value of Gluon 2016.2.x now? |
i'm not sure how this could be reproduced in a comparable way. |
why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2 |
"while not changing anything else" - i doubt this will apply. and waiting for someone else to upgrade while still running v2016.2.x which didn't get any security update for a long time.... is not really responsible. |
On 2019-07-27 02:28, Andreas Ziegler wrote:
> why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2
"while not changing anything else" - i doubt this will apply.
and waiting for someone else to upgrade while still running v2016.2.x which didn't get any security update for a long time.... is not really responsible.
poor node owners...
To have a more clear ENV it should be able to just deploy hundreds of VMs in a
isolated v2016.2.x and the same for v2018.2.x.
vg
Tarek
|
hm... without any clients? and does someone already have an automation for this? |
c07326c batman-adv: Fix duplicated OGMs on NETDEV_UP fixes freifunk-gluon#1446 (cherry picked from commit 9e00ecd)
this week we upgraded around 75% of our nodes to Gluon v2017.1.8, while about 20% were using this version already.
before, the nodes were running Gluon v2016.2.7.
(we still have ~20 nodes running Gluon v2016.1.x - i have no control over those, no autoupdate and no SSH)
in total, we have about 540 online nodes.
starting with the upgrade, the mgmt_tx of a single idle node without clients increased from 153 kbit/s to 254 kbit/s (by 66%) and the mgmt_rx from 57 kbit/s to 90 kbit/s (58%)
see the attached graph which shows the increased traffic with lowered spikes over all nodes, also the single node graph. the "all nodes" graph is capped at 1000 kbit/s to be able to see something useful, the spikes would make that impossible without the cap.
@ecsv also reported similar data from his community.
The text was updated successfully, but these errors were encountered: