significantly increased management traffic after upgrade to Gluon v2017.1.x #1446

rotanid · 2018-06-24T13:43:50Z

this week we upgraded around 75% of our nodes to Gluon v2017.1.8, while about 20% were using this version already.
before, the nodes were running Gluon v2016.2.7.
(we still have ~20 nodes running Gluon v2016.1.x - i have no control over those, no autoupdate and no SSH)
in total, we have about 540 online nodes.

starting with the upgrade, the mgmt_tx of a single idle node without clients increased from 153 kbit/s to 254 kbit/s (by 66%) and the mgmt_rx from 57 kbit/s to 90 kbit/s (58%)

see the attached graph which shows the increased traffic with lowered spikes over all nodes, also the single node graph. the "all nodes" graph is capped at 1000 kbit/s to be able to see something useful, the spikes would make that impossible without the cap.

@ecsv also reported similar data from his community.

ecsv · 2018-06-26T08:48:30Z

It is currently unknown why this happens. I have two main suspects

batman-adv sends more ogms for an unknown reason
the changes in the wireless layer increased the reliability of broadcasts (which are used to transport OGMs)

The effect can only be seen when a significant portion nodes are updated from 2016.2.x to 2017.1.x

Regarding the batman-adv first point: Freifunk Vogtland increased its batman-adv version during the update from 2017.0.1+norebroadcast to 2018.1+maint+norebroadcast patches. Rotanid's update changed batman-adv from 2016.2.x+maint patches+norebroadcast to 2017.2+maint patches. I would therefore guess that the culprit (when batman-adv is the reason) has to be searched between batman-adv v2017.0.1..v2017.2

Maybe somebody is able to reproduce the problem and can install a modified gluon 2017.1.8 version on a lot of nodes in his network. It would be good when this person can then check whether returning to batman-adv 2016.2.x (from gluon 2016.2.x) reduces the number of mgmt overhead again. The three patches to revert to the older batman-adv version can be found in the branch https://github.com/FreifunkVogtland/gluon/commits/batadv/2016.2.x

rotanid · 2018-07-01T21:40:29Z

FYI, my "spikes" are back (like before the upgrade) - the management traffic remains at the higher level

lephisto · 2018-07-15T12:45:06Z

I think this also applies to v2018.1. Segmentation done with v2018.1 Rollout. Half the Nodes but increased management Traffic.

Update:

This describes it quite well. It is a node with nearly no Client Traffic. Orginator Interval has not changed..(5s)

rotanid · 2018-11-02T15:03:55Z

after updating almost all nodes (over 500) to Gluon v2018.1.x with batman-adv 2018.1, mgmt_tx increased by ~10kbit/s and mgmt_rx by ~40kbit/s, so it increased again, but not nearly as much as last time.

neocturne · 2018-12-07T22:10:20Z

@ecsv Have there been any new insights regarding this issue?

ecsv · 2018-12-07T22:11:28Z

At least I received no info and I also cannot provide any

awlx · 2019-04-08T08:36:26Z

We (Ffmuc) also see a heavy increase of mgmt traffic after updating from gluon v2016.2.7 to gluon2018.2.1.

We have about 1500 nodes and our mgmt traffic nearly doubled.

neocturne · 2019-04-22T17:51:44Z

@T-X Is it possible that this is related to the removal of the no_rebroadcast hack from our batman-adv?

T-X · 2019-04-28T18:49:34Z

@NeoRaider: Hm, no, the no_rebroadcast patch should only have touched layer 2 broadcast frames, but not mgmt traffic (OGMs). The no_rebroadcast patch was only applied on "batadv_send_outstanding_bcast_packet()".

Similarly, the new, automatic approach in batman-adv is only applied to layer 2 broadcast frames and to BATMAN_V OGMs. But not BATMAN_IV.

@awlx: Urgh, yeah, that graph looks ugly... Is that a capture from a Freifunk router or from a gateway? Would it be possible for you to provide a capture from the VPN interface, filtered by OGMs (with wireshark or tshark you should be able to filter by batadv packet type)?

T-X · 2019-04-28T18:51:31Z

Also, do you have link to the a site/domain config used before and after the update, @awlx?

krombel · 2019-04-28T20:23:03Z

Hi @T-X, we as ffmuc had the following configs:

before: v2018.1
after: v2019.0.1

In the hope to reduce that traffic we then disabled ULA and ibss in the hope it would help but the benefits of v2019.0.2 where quite minimal to not observable.

Concerning the graph @awlx posted: It was from our grafana: The average reported mgmt traffic of all nodes collected via respondd. So it was only observable for us as the release got rolled out to more nodes...

Concerning a capture: I will try to do a capture on the vpn interface. Do you have a link or a simple command I can use to create it?

T-X · 2019-04-29T16:47:59Z

Hi @krombel,

You can filter just for the mgmt (== OGMs, originator messages) with this command:

$ tcpdump -i mesh-vpn0 'ether proto 0x4305 and ether[14] = 0x00 and ether[15] = 0x0f' -w /tmp/mgmt.cap

The capture filter here means:

ether proto 0x4305: The batman-adv ethernet frame type
ether[14] = 0x00: The batman-adv packet type for OGMs (v1) as defined here for the first byte after the ethernet frame header as defined here
ether[15] = 0x0f: The batman-adv compatibility version number (15, used since 2014) which is defined here for the second byte after the ethernet frame header as defined here

(An equivalent display filter for tshark or Wireshark would be eth.type == 0x4305 && batadv.batman.packet_type == 0x00 && batadv.iv_ogm.version == 15 in case you might be more familiar with those tools.)

T-X · 2019-04-29T17:16:39Z

In the hope to reduce that traffic we then disabled ULA and ibss in the hope it would help but the benefits of v2019.0.2 where quite minimal to not observable.

Hm, weird. Extra wifi interfaces in the new firmware could have been an explanation as the "mgmt" count is the sum of originator messages sent and received over all interfaces used by batman-adv. The mgmt_tx counter for instance is increased here but the caller, batadv_iv_ogm_send_to_if() is called for each outgoing interface.

Also when keeping in mind that we send OGMs three times on a wifi interface to compensate potential wifi packet loss vs. once on any other interface (including VPN interfaces) could have roughly matched your total increase, like 3+1=4 / 3+3+1=7 (single radio / dual radio) packets before, 3+3+1=7 / 3+3+3+3+1 = 13 (single radio / dual radio) packets after (disregarding a few other factors like OGM packet aggregation and rebroadcast suppression which depend on the topology).

Have you updated all your nodes to your site v2019.0.2 by now? Could you check whether the Gluon upgrade scripts have successfully removed the IBSS interface with your v2019.0.2 upgrade?

T-X · 2019-04-29T17:38:57Z

@ecsv

It is currently unknown why this happens. I have two main suspects

batman-adv sends more ogms for an unknown reason

I did some simple, two node tests in virtual machines. At least in this most simple setup there was no difference in OGMs between batman-adv v2017.0.1, v2018.1 and the current master branch:

https://gist.github.com/T-X/90cda122ae30ddd5b860a6df0987fc77

the changes in the wireless layer increased the reliability of broadcasts (which are used to transport OGMs)

I think that's what I would also tend to. Gluon v2016.2.x was pre-FQ-Codel and Gluon 2017 introduced FQ-Codel, right? That would have probably made a difference. Also, at some point the airtime fairness patch was added (to ath9k?) in OpenWrt (which version? And which Gluon version?).

Also note, that we should have reduced layer 2 broadcast overhead with the gluon-ebtables-limit-arp, the IGMP/MLD segmentation and the batman-adv broadcast avoidance patches. Which would lead to less pesky, small broadcast packets and therefore less possibilities for wifi packets, including OGMs, to collide with.

Ideally to conclude that the increased mgmt packets are caused by increased wifi reliability: Any chance anyone has TQ values before and after their update in their database? An overall average (and/or median) of that before and after the update would be interesting for comparison.

T-X · 2019-05-31T21:10:07Z

Ok, we have now had another community reporting an increased OGM/mgmt traffic:

Freifunk Kiel with a Gluon v2018.1.4 updated from batman-adv-legacy (compat-14) to batman-adv (compat 15). And have observed a 4x mgmt traffic increase:

https://grafana.freifunk.in-kiel.de/d/000000003/nodeinfo?orgId=1&panelId=3&width=1200&height=600&from=1558188586085&to=1558433114387&var-node=704f57455064&fullscreen

The interesting thing is that with this update they did not update the Gluon version, just switched the BATMAN variant.

Looking at pcap dumps we observed that for one thing the average OGM size has about doubled due to the added TVLVs.

The other 2x factor currently seems to point to a too fast OGM interval. The interval is configured to 5 seconds, but in practice they seem to be transmitted at about ~2-3 seconds rates. We have picked two random nodes which had the same behavior. Does anyone else observe a similar behavior?

I'll see whether I can write something to measure this in more detail and to make an overall statistic.

So far from looking at the code I can't find anything weird yet. Nor any specific changes related to the BATMAN IV OGM scheduler between compat 14 and compat 15.

T-X · 2019-05-31T22:50:21Z

Also when keeping in mind that we send OGMs three times on a wifi interface to compensate potential wifi packet loss

And sorry, that was actually wrong. We do the 3x broadcast for broadcast data packets only. And not for OGMs.

ecsv · 2019-06-02T09:57:30Z

Please test the changes proposed here (changes for all kind of versions and OpenWrt/LEDE versions): https://www.open-mesh.org/issues/380#note-4

T-X · 2019-06-02T13:54:14Z

And for the record, the issue was introduced by batman-adv v2016.3.

Gluon v2016.2.7: batman-adv v2016.2 => unaffected
Gluon v2017.1 (or greater): batman-adv v2017.1 (and greater) => affected

Thanks to everyone for the incredibly helpful feedback (graphs, statistics and even Lua evaluation scripts for Wireshark (kudos to @sargon for the latter), ...) and thanks to all these amazing people that are having a watchful eye on how our mesh networks behave. You guys and gals are amazing :-).

ecsv · 2019-06-02T14:10:46Z

I have now installed it here in a domain and attached is an image with the state for 3 randomly chosen nodes before the patch (left part of the graph) and after the patch (smaller right part of the graph). The gap in between is the time when the nodes were updating.

mweinelt · 2019-06-02T15:24:53Z

Looking forward to a backport of that patch!

mweinelt · 2019-06-02T22:20:09Z

The patch is now part of Gluon master and comparing a few graphs above I think that fixes it.

If for any reason people think otherwise please comment and we can reopen.

We should backport this and do a v2018.2.2 release soon.

ecsv · 2019-06-05T06:51:30Z

Here is the graph for a server which is connected to all domains (so you see a combined graph of everything). 88% of the 451 nodes were updated to the new firmware:

The mgmt (down/up) went from 284.2/210.4 kbps to 153.1/111.5 kbps.

c07326c batman-adv: Fix duplicated OGMs on NETDEV_UP fixes #1446 (cherry picked from commit 9e00ecd)

rotanid · 2019-07-19T21:48:57Z

after having had ~70 nodes out of ~650 online nodes with the fix for a few weeks, we finally deployed to stable branch.
the load average of all nodes peaked to 0.42 before the rollout and is now at 0.36 in the last two days, although the overall load looks more like having dropped 30-40%

mesh mgmt traffic AVG is down from 292/364 to 162/201 kbit/s in our network (note: the previous value already had around 10% of the nodes fixed)

RalfJung · 2019-07-20T12:27:22Z

We are also seeing significantly reduced (to less than 60%) management traffic in our 600-node-mesh. Thanks a lot to everyone involved in finding this!

T-X · 2019-07-20T14:40:48Z

Awesome! Btw, do any communities have load statistics before and after this fix?

Ideally I'd be interested in overall load average+median before+after, together with the number of nodes in the domain. And averages/median filtered for 32MB flash devices.

(I want to know whether the number of "background packets" has a noticeable impact on a node's overall load.)

sargon · 2019-07-20T15:46:51Z

@T-X here you go, I think you can figure out when the update had happened ;-)
https://grafana.freifunk.in-kiel.de/d/yA3Quidmk/node-overview?orgId=1&from=1561766040369&to=1563337377334

rotanid · 2019-07-21T13:07:13Z

Awesome! Btw, do any communities have load statistics before and after this fix?

although i don't have it as detailed as you request it, my comment contained a screenshot of the load graph.

rubo77 · 2019-07-25T06:14:52Z

Is the load back to the value of Gluon 2016.2.x now?

rotanid · 2019-07-25T09:13:47Z

Is the load back to the value of Gluon 2016.2.x now?

i'm not sure how this could be reproduced in a comparable way.
you would need a larger mesh with hundreds of nodes and test it with v2018.2.2 and v2016.2.x while not changing anything else, this is basically impossible.

rubo77 · 2019-07-25T22:47:31Z

why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2

rotanid · 2019-07-27T00:28:29Z

why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2

"while not changing anything else" - i doubt this will apply.

and waiting for someone else to upgrade while still running v2016.2.x which didn't get any security update for a long time.... is not really responsible.
poor node owners...

2tata · 2019-07-27T09:01:02Z

On 2019-07-27 02:28, Andreas Ziegler wrote: > why should this be impossible? there are some communities still running 2016.2.x and they could monitor the change while updateing to 2018.2.2 "while not changing anything else" - i doubt this will apply. and waiting for someone else to upgrade while still running v2016.2.x which didn't get any security update for a long time.... is not really responsible. poor node owners...

To have a more clear ENV it should be able to just deploy hundreds of VMs in a isolated v2016.2.x and the same for v2018.2.x. vg Tarek

rotanid · 2019-07-27T12:12:21Z

To have a more clear ENV it should be able to just deploy hundreds of VMs in a isolated v2016.2.x and the same for v2018.2.x. vg Tarek

hm... without any clients? and does someone already have an automation for this?

c07326c batman-adv: Fix duplicated OGMs on NETDEV_UP fixes freifunk-gluon#1446 (cherry picked from commit 9e00ecd)

rotanid changed the title ~~significantly increased managment traffic in segment after upgrade to Gluon v2017.1.x~~ significantly increased managment traffic after upgrade to Gluon v2017.1.x Jun 24, 2018

rotanid changed the title ~~significantly increased managment traffic after upgrade to Gluon v2017.1.x~~ significantly increased management traffic after upgrade to Gluon v2017.1.x Aug 7, 2018

rotanid added the 0. type: bug This is a bug label Nov 2, 2018

mweinelt closed this as completed in 9e00ecd Jun 2, 2019

rotanid mentioned this issue Jun 5, 2019

batman-adv: update to current openwrt-routing / batman-adv v2019.2 #1733

Merged

mweinelt added a commit that referenced this issue Jun 8, 2019

modules: update routing

d303316

c07326c batman-adv: Fix duplicated OGMs on NETDEV_UP fixes #1446 (cherry picked from commit 9e00ecd)

joerg-d pushed a commit to ffggrz/gluon that referenced this issue Sep 17, 2019

modules: update routing

c630cfe

c07326c batman-adv: Fix duplicated OGMs on NETDEV_UP fixes freifunk-gluon#1446 (cherry picked from commit 9e00ecd)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

significantly increased management traffic after upgrade to Gluon v2017.1.x #1446

significantly increased management traffic after upgrade to Gluon v2017.1.x #1446

rotanid commented Jun 24, 2018 •

edited

Loading

ecsv commented Jun 26, 2018

rotanid commented Jul 1, 2018

lephisto commented Jul 15, 2018 •

edited

Loading

rotanid commented Nov 2, 2018

neocturne commented Dec 7, 2018

ecsv commented Dec 7, 2018

awlx commented Apr 8, 2019 •

edited

Loading

neocturne commented Apr 22, 2019

T-X commented Apr 28, 2019

T-X commented Apr 28, 2019

krombel commented Apr 28, 2019

T-X commented Apr 29, 2019

T-X commented Apr 29, 2019

T-X commented Apr 29, 2019

T-X commented May 31, 2019

T-X commented May 31, 2019

ecsv commented Jun 2, 2019

T-X commented Jun 2, 2019

ecsv commented Jun 2, 2019

mweinelt commented Jun 2, 2019

mweinelt commented Jun 2, 2019

ecsv commented Jun 5, 2019

rotanid commented Jul 19, 2019

RalfJung commented Jul 20, 2019

T-X commented Jul 20, 2019

sargon commented Jul 20, 2019 •

edited

Loading

rotanid commented Jul 21, 2019

rubo77 commented Jul 25, 2019

rotanid commented Jul 25, 2019

rubo77 commented Jul 25, 2019

rotanid commented Jul 27, 2019

2tata commented Jul 27, 2019 via email

rotanid commented Jul 27, 2019

significantly increased management traffic after upgrade to Gluon v2017.1.x #1446

significantly increased management traffic after upgrade to Gluon v2017.1.x #1446

Comments

rotanid commented Jun 24, 2018 • edited Loading

ecsv commented Jun 26, 2018

rotanid commented Jul 1, 2018

lephisto commented Jul 15, 2018 • edited Loading

rotanid commented Nov 2, 2018

neocturne commented Dec 7, 2018

ecsv commented Dec 7, 2018

awlx commented Apr 8, 2019 • edited Loading

neocturne commented Apr 22, 2019

T-X commented Apr 28, 2019

T-X commented Apr 28, 2019

krombel commented Apr 28, 2019

T-X commented Apr 29, 2019

T-X commented Apr 29, 2019

T-X commented Apr 29, 2019

T-X commented May 31, 2019

T-X commented May 31, 2019

ecsv commented Jun 2, 2019

T-X commented Jun 2, 2019

ecsv commented Jun 2, 2019

mweinelt commented Jun 2, 2019

mweinelt commented Jun 2, 2019

ecsv commented Jun 5, 2019

rotanid commented Jul 19, 2019

RalfJung commented Jul 20, 2019

T-X commented Jul 20, 2019

sargon commented Jul 20, 2019 • edited Loading

rotanid commented Jul 21, 2019

rubo77 commented Jul 25, 2019

rotanid commented Jul 25, 2019

rubo77 commented Jul 25, 2019

rotanid commented Jul 27, 2019

2tata commented Jul 27, 2019 via email

rotanid commented Jul 27, 2019

rotanid commented Jun 24, 2018 •

edited

Loading

lephisto commented Jul 15, 2018 •

edited

Loading

awlx commented Apr 8, 2019 •

edited

Loading

sargon commented Jul 20, 2019 •

edited

Loading