Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

significantly increased management traffic after upgrade to Gluon v2017.1.x #1446

Closed
rotanid opened this issue Jun 24, 2018 · 33 comments
Closed
Labels
0. type: bug This is a bug

Comments

@rotanid
Copy link
Member

rotanid commented Jun 24, 2018

this week we upgraded around 75% of our nodes to Gluon v2017.1.8, while about 20% were using this version already.
before, the nodes were running Gluon v2016.2.7.
(we still have ~20 nodes running Gluon v2016.1.x - i have no control over those, no autoupdate and no SSH)
in total, we have about 540 online nodes.

starting with the upgrade, the mgmt_tx of a single idle node without clients increased from 153 kbit/s to 254 kbit/s (by 66%) and the mgmt_rx from 57 kbit/s to 90 kbit/s (58%)

see the attached graph which shows the increased traffic with lowered spikes over all nodes, also the single node graph. the "all nodes" graph is capped at 1000 kbit/s to be able to see something useful, the spikes would make that impossible without the cap.

@ecsv also reported similar data from his community.

2018-06-24_15-39_gluon_network_mgmt_traffic_increase
2018-06-24_15-41_gluon_node_v201718_traffic_rates

@rotanid rotanid changed the title significantly increased managment traffic in segment after upgrade to Gluon v2017.1.x significantly increased managment traffic after upgrade to Gluon v2017.1.x Jun 24, 2018
@ecsv
Copy link
Contributor

ecsv commented Jun 26, 2018

It is currently unknown why this happens. I have two main suspects

  • batman-adv sends more ogms for an unknown reason
  • the changes in the wireless layer increased the reliability of broadcasts (which are used to transport OGMs)

The effect can only be seen when a significant portion nodes are updated from 2016.2.x to 2017.1.x

Regarding the batman-adv first point: Freifunk Vogtland increased its batman-adv version during the update from 2017.0.1+norebroadcast to 2018.1+maint+norebroadcast patches. Rotanid's update changed batman-adv from 2016.2.x+maint patches+norebroadcast to 2017.2+maint patches. I would therefore guess that the culprit (when batman-adv is the reason) has to be searched between batman-adv v2017.0.1..v2017.2


Maybe somebody is able to reproduce the problem and can install a modified gluon 2017.1.8 version on a lot of nodes in his network. It would be good when this person can then check whether returning to batman-adv 2016.2.x (from gluon 2016.2.x) reduces the number of mgmt overhead again. The three patches to revert to the older batman-adv version can be found in the branch https://github.com/FreifunkVogtland/gluon/commits/batadv/2016.2.x

@rotanid
Copy link
Member Author

rotanid commented Jul 1, 2018

FYI, my "spikes" are back (like before the upgrade) - the management traffic remains at the higher level

@lephisto
Copy link

lephisto commented Jul 15, 2018

I think this also applies to v2018.1. Segmentation done with v2018.1 Rollout. Half the Nodes but increased management Traffic.

Update:

image

This describes it quite well. It is a node with nearly no Client Traffic. Orginator Interval has not changed..(5s)

@rotanid rotanid changed the title significantly increased managment traffic after upgrade to Gluon v2017.1.x significantly increased management traffic after upgrade to Gluon v2017.1.x Aug 7, 2018
@rotanid rotanid added the 0. type: bug This is a bug label Nov 2, 2018
@rotanid
Copy link
Member Author

rotanid commented Nov 2, 2018

after updating almost all nodes (over 500) to Gluon v2018.1.x with batman-adv 2018.1, mgmt_tx increased by ~10kbit/s and mgmt_rx by ~40kbit/s, so it increased again, but not nearly as much as last time.

@neocturne
Copy link
Member

@ecsv Have there been any new insights regarding this issue?

@ecsv
Copy link
Contributor

ecsv commented Dec 7, 2018

At least I received no info and I also cannot provide any

@awlx
Copy link
Member

awlx commented Apr 8, 2019

We (Ffmuc) also see a heavy increase of mgmt traffic after updating from gluon v2016.2.7 to gluon2018.2.1.

We have about 1500 nodes and our mgmt traffic nearly doubled.

mgmt-traffic

@neocturne
Copy link
Member

@T-X Is it possible that this is related to the removal of the no_rebroadcast hack from our batman-adv?

@T-X
Copy link
Contributor

T-X commented Apr 28, 2019

@NeoRaider: Hm, no, the no_rebroadcast patch should only have touched layer 2 broadcast frames, but not mgmt traffic (OGMs). The no_rebroadcast patch was only applied on "batadv_send_outstanding_bcast_packet()".

Similarly, the new, automatic approach in batman-adv is only applied to layer 2 broadcast frames and to BATMAN_V OGMs. But not BATMAN_IV.

@awlx: Urgh, yeah, that graph looks ugly... Is that a capture from a Freifunk router or from a gateway? Would it be possible for you to provide a capture from the VPN interface, filtered by OGMs (with wireshark or tshark you should be able to filter by batadv packet type)?

@T-X
Copy link
Contributor

T-X commented Apr 28, 2019

Also, do you have link to the a site/domain config used before and after the update, @awlx?

@krombel
Copy link

krombel commented Apr 28, 2019

Hi @T-X, we as ffmuc had the following configs:

In the hope to reduce that traffic we then disabled ULA and ibss in the hope it would help but the benefits of v2019.0.2 where quite minimal to not observable.

Concerning the graph @awlx posted: It was from our grafana: The average reported mgmt traffic of all nodes collected via respondd. So it was only observable for us as the release got rolled out to more nodes...

Concerning a capture: I will try to do a capture on the vpn interface. Do you have a link or a simple command I can use to create it?

@T-X
Copy link
Contributor

T-X commented Apr 29, 2019

Hi @krombel,

You can filter just for the mgmt (== OGMs, originator messages) with this command:

$ tcpdump -i mesh-vpn0 'ether proto 0x4305 and ether[14] = 0x00 and ether[15] = 0x0f' -w /tmp/mgmt.cap

The capture filter here means:

  • ether proto 0x4305: The batman-adv ethernet frame type
  • ether[14] = 0x00: The batman-adv packet type for OGMs (v1) as defined here for the first byte after the ethernet frame header as defined here
  • ether[15] = 0x0f: The batman-adv compatibility version number (15, used since 2014) which is defined here for the second byte after the ethernet frame header as defined here

(An equivalent display filter for tshark or Wireshark would be eth.type == 0x4305 && batadv.batman.packet_type == 0x00 && batadv.iv_ogm.version == 15 in case you might be more familiar with those tools.)

@T-X
Copy link
Contributor

T-X commented Apr 29, 2019

In the hope to reduce that traffic we then disabled ULA and ibss in the hope it would help but the benefits of v2019.0.2 where quite minimal to not observable.

Hm, weird. Extra wifi interfaces in the new firmware could have been an explanation as the "mgmt" count is the sum of originator messages sent and received over all interfaces used by batman-adv. The mgmt_tx counter for instance is increased here but the caller, batadv_iv_ogm_send_to_if() is called for each outgoing interface.

Also when keeping in mind that we send OGMs three times on a wifi interface to compensate potential wifi packet loss vs. once on any other interface (including VPN interfaces) could have roughly matched your total increase, like 3+1=4 / 3+3+1=7 (single radio / dual radio) packets before, 3+3+1=7 / 3+3+3+3+1 = 13 (single radio / dual radio) packets after (disregarding a few other factors like OGM packet aggregation and rebroadcast suppression which depend on the topology).

Have you updated all your nodes to your site v2019.0.2 by now? Could you check whether the Gluon upgrade scripts have successfully removed the IBSS interface with your v2019.0.2 upgrade?

@T-X
Copy link
Contributor

T-X commented Apr 29, 2019

@ecsv

It is currently unknown why this happens. I have two main suspects

  • batman-adv sends more ogms for an unknown reason

I did some simple, two node tests in virtual machines. At least in this most simple setup there was no difference in OGMs between batman-adv v2017.0.1, v2018.1 and the current master branch:

https://gist.github.com/T-X/90cda122ae30ddd5b860a6df0987fc77

  • the changes in the wireless layer increased the reliability of broadcasts (which are used to transport OGMs)

I think that's what I would also tend to. Gluon v2016.2.x was pre-FQ-Codel and Gluon 2017 introduced FQ-Codel, right? That would have probably made a difference. Also, at some point the airtime fairness patch was added (to ath9k?) in OpenWrt (which version? And which Gluon version?).

Also note, that we should have reduced layer 2 broadcast overhead with the gluon-ebtables-limit-arp, the IGMP/MLD segmentation and the batman-adv broadcast avoidance patches. Which would lead to less pesky, small broadcast packets and therefore less possibilities for wifi packets, including OGMs, to collide with.

Ideally to conclude that the increased mgmt packets are caused by increased wifi reliability: Any chance anyone has TQ values before and after their update in their database? An overall average (and/or median) of that before and after the update would be interesting for comparison.

@T-X
Copy link
Contributor

T-X commented May 31, 2019

Ok, we have now had another community reporting an increased OGM/mgmt traffic:

Freifunk Kiel with a Gluon v2018.1.4 updated from batman-adv-legacy (compat-14) to batman-adv (compat 15). And have observed a 4x mgmt traffic increase:

https://grafana.freifunk.in-kiel.de/d/000000003/nodeinfo?orgId=1&panelId=3&width=1200&height=600&from=1558188586085&to=1558433114387&var-node=704f57455064&fullscreen

The interesting thing is that with this update they did not update the Gluon version, just switched the BATMAN variant.

Looking at pcap dumps we observed that for one thing the average OGM size has about doubled due to the added TVLVs.

The other 2x factor currently seems to point to a too fast OGM interval. The interval is configured to 5 seconds, but in practice they seem to be transmitted at about ~2-3 seconds rates. We have picked two random nodes which had the same behavior. Does anyone else observe a similar behavior?

I'll see whether I can write something to measure this in more detail and to make an overall statistic.

So far from looking at the code I can't find anything weird yet. Nor any specific changes related to the BATMAN IV OGM scheduler between compat 14 and compat 15.

@T-X
Copy link
Contributor

T-X commented May 31, 2019

Also when keeping in mind that we send OGMs three times on a wifi interface to compensate potential wifi packet loss

And sorry, that was actually wrong. We do the 3x broadcast for broadcast data packets only. And not for OGMs.

@ecsv
Copy link
Contributor

ecsv commented Jun 2, 2019

Please test the changes proposed here (changes for all kind of versions and OpenWrt/LEDE versions): https://www.open-mesh.org/issues/380#note-4

@T-X
Copy link
Contributor

T-X commented Jun 2, 2019

And for the record, the issue was introduced by batman-adv v2016.3.

  • Gluon v2016.2.7: batman-adv v2016.2 => unaffected
  • Gluon v2017.1 (or greater): batman-adv v2017.1 (and greater) => affected

Thanks to everyone for the incredibly helpful feedback (graphs, statistics and even Lua evaluation scripts for Wireshark (kudos to @sargon for the latter), ...) and thanks to all these amazing people that are having a watchful eye on how our mesh networks behave. You guys and gals are amazing :-).

@ecsv
Copy link
Contributor

ecsv commented Jun 2, 2019

I have now installed it here in a domain and attached is an image with the state for 3 randomly chosen nodes before the patch (left part of the graph) and after the patch (smaller right part of the graph). The gap in between is the time when the nodes were updating.

2019-06-02_mgmt-before-patch_after-patch

@mweinelt
Copy link
Contributor

mweinelt commented Jun 2, 2019

Looking forward to a backport of that patch!

@mweinelt
Copy link
Contributor

mweinelt commented Jun 2, 2019

The patch is now part of Gluon master and comparing a few graphs above I think that fixes it.

If for any reason people think otherwise please comment and we can reopen.

We should backport this and do a v2018.2.2 release soon.

@ecsv
Copy link
Contributor

ecsv commented Jun 5, 2019

Here is the graph for a server which is connected to all domains (so you see a combined graph of everything). 88% of the 451 nodes were updated to the new firmware:

2019-06-05_ffv-all-domains

The mgmt (down/up) went from 284.2/210.4 kbps to 153.1/111.5 kbps.

mweinelt added a commit that referenced this issue Jun 8, 2019
c07326c batman-adv: Fix duplicated OGMs on NETDEV_UP

fixes #1446

(cherry picked from commit 9e00ecd)
@rotanid
Copy link
Member Author

rotanid commented Jul 19, 2019

after having had ~70 nodes out of ~650 online nodes with the fix for a few weeks, we finally deployed to stable branch.
the load average of all nodes peaked to 0.42 before the rollout and is now at 0.36 in the last two days, although the overall load looks more like having dropped 30-40%
2019-07-19_23-43-26_load_avg
mesh mgmt traffic AVG is down from 292/364 to 162/201 kbit/s in our network (note: the previous value already had around 10% of the nodes fixed)
2019-07-19_23-47-53_mesh_traffic

@RalfJung
Copy link
Contributor

We are also seeing significantly reduced (to less than 60%) management traffic in our 600-node-mesh. Thanks a lot to everyone involved in finding this!

@T-X
Copy link
Contributor

T-X commented Jul 20, 2019

Awesome! Btw, do any communities have load statistics before and after this fix?

Ideally I'd be interested in overall load average+median before+after, together with the number of nodes in the domain. And averages/median filtered for 32MB flash devices.

(I want to know whether the number of "background packets" has a noticeable impact on a node's overall load.)

@sargon
Copy link
Contributor

sargon commented Jul 20, 2019

@T-X here you go, I think you can figure out when the update had happened ;-)
https://grafana.freifunk.in-kiel.de/d/yA3Quidmk/node-overview?orgId=1&from=1561766040369&to=1563337377334

@rotanid
Copy link
Member Author

rotanid commented Jul 21, 2019

Awesome! Btw, do any communities have load statistics before and after this fix?

although i don't have it as detailed as you request it, my comment contained a screenshot of the load graph.

@rubo77
Copy link
Contributor

rubo77 commented Jul 25, 2019

Is the load back to the value of Gluon 2016.2.x now?

@rotanid
Copy link
Member Author

rotanid commented Jul 25, 2019

Is the load back to the value of Gluon 2016.2.x now?

i'm not sure how this could be reproduced in a comparable way.
you would need a larger mesh with hundreds of nodes and test it with v2018.2.2 and v2016.2.x while not changing anything else, this is basically impossible.

@rubo77
Copy link
Contributor

rubo77 commented Jul 25, 2019

why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2

@rotanid
Copy link
Member Author

rotanid commented Jul 27, 2019

why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2

"while not changing anything else" - i doubt this will apply.

and waiting for someone else to upgrade while still running v2016.2.x which didn't get any security update for a long time.... is not really responsible.
poor node owners...

@2tata
Copy link
Contributor

2tata commented Jul 27, 2019 via email

@rotanid
Copy link
Member Author

rotanid commented Jul 27, 2019

To have a more clear ENV it should be able to just deploy hundreds of VMs in a isolated v2016.2.x and the same for v2018.2.x. vg Tarek

hm... without any clients? and does someone already have an automation for this?

joerg-d pushed a commit to ffggrz/gluon that referenced this issue Sep 17, 2019
c07326c batman-adv: Fix duplicated OGMs on NETDEV_UP

fixes freifunk-gluon#1446

(cherry picked from commit 9e00ecd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug
Projects
None yet
Development

No branches or pull requests