-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gluon-ebtables-limit-arp: a package for ARP rate-limiting #1113
Conversation
Thank you for the package. I noticed that the lua scripts are in files instead of luasrc. That way they will not be processed by the minifier. Is that intentional? I suggest to move them to luasrc. |
adding gluon ebtables support from T-X freifunk-gluon/gluon#1113
@christf: The main reason was basically that I forgot to add the minifier in the end :-). A minor reason for keeping it un-minified would be to have an example in this ebtables updater and to allow other communities to easilly try different parameters for the rate limiting or to extend it. Those were values which seemed to make sense to me from experiences with nodes at Freifunk Lübeck. |
Btw., one way to extend it I was thinking about was to add another ebtables worker which periodically looks at an iw station dump for the AP interface or "batctl tl" and adds one ebtables rate limiting rule per each local MAC address, but with more strict limits. That way persistent scanners, like "smart" printers etc., which usually keep their MAC address, would be moved out of the general per node limits completely. I wanted to start simple with this first PR though :-). But maybe someone would like to play with extensions like that already, is too lazy to copy the original, non-minified version and for those a non-minified version could be helpful to start with. But I can also just add the minifier call :D (give me a second). |
This looks very interesting, especially the batman-adv exclusion. 50 arps look a bit high, do you expect that high value after a boot or network-split or why did you choose such a high number? |
I've tested this package. If I arp-scan a subnet, I first got ~50 packages through, then the rate dropped to 1/second - as expected. But the rate-limiter is resetting after 1 minute. So you'll get 50 packages burst every minute while dropping the rate back to 1/s between those bursts. Is this as expected? I thought as long as the input value is too high, the rate-limiter is active until it dropped below the limiter. |
at least the ebtables doc don't say anything about a reset of the burst limit, if the 1/sec limit is constantly achieved: |
Yeah that’s what I read, but my test showed this is happening every 60 seconds. :) |
btw., has anyone tested this in a live network? while we wait for the core review, this would be another TODO i guess |
@rotanid Please specify "this". This goes for the 60seconds behaviour Ruben is talking about? Or do you mean the package in the PR? |
@rotanid I‘ve changed this package slightly, reducing outgoing ARPs to a static 2/second without the burst-option. This is working fine on a dev-version, but needs to be tested on larger scale. |
@Adorfer the package in general. |
@rotanid my best guess is, that on those low memory devices the burst-option just reset once a minute or so to reduce the load on the machine, else all dropped packets of the filter has to be monitored too. Or the documentation is just not really clear on the actual behavior. In the most cases a 2 arps from one node filter should be sufficient, since it would only exceeded if many devices would roam from one node to another while exchanging data with many other clients on the network. So if there isn’t many users which does conference calls with many unicast-connections while roaming, the threshold shouldn’t be reached. I watched our networks for a while: 8-10 arps/second in networks with >400 clients is an average number, while most of them are issued by the gateway searching for old devices. Many clients tend to send an ARP as a burst of 10 pakets in a row (seen on old android devices from HTC) as well as many ARPs in a row with a small gap while the roundtrip cannot be finished anyway. So I doubt a hard-limit of 2/sec would break any useful usecase. :) |
hm, sounds really low though :D what about a lower burst so that a reset per minute isn't bad? |
With the 50x burst I was thinking about peer to peer applications. I think the Transmission Bittorrent client for instance allowed 50 simultaneous connections by default. So the idea was that once you start your P2P application stay conservative and avoid potential, spurious hiccups. While currently it is probably very unlikely, that you'd have 50 peers from within the local mesh and not from somewhere else on the Internet I was opting for a conservative approach as spurious behavior/issues are always a pain to debug later. |
@T-X but in this scenario you don’t need a lot of arps either: the peer-finding is done by multicasts, which should result in none arps and as long as some traffic flows between the peers a lookup while roaming is not necessary. The main issue with this filter-approach are roamings which result in a >60 seconds connection loss, in this case the arp-cache might be invalid but the neighbor-ips still might be valid to the BitTorrent client. In this case the restart would be limited by the arp-filter but not much, since 50 connections could be established in 25 seconds. There might be an additional problem in the long term with all arp-filtering: if one client does a lot of arp, all other clients might be unable to issue an arp thru the filter. We might want to fix this by a filter like 2/sec per hashtable(mac) and a general limit of 4/sec per node or similar. |
That's a totally different layer. After peer discovery IPv4 still needs to resolve the discovered IP addresses to MAC addresses via ARP.
I fail to see how. Upon roaming, a device keeps its IP address(es) and connections. A device won't reissue ARP Requests on roaming, for instance. |
Right. See my comment in #1113 (comment) |
live since several weeks Test since 26. Mai(?) currently in the limits "20/s" triggering. (which is -as discussed above- probably still far too hight) (this totally migitated the issues discussed in https://forum.freifunk.net/t/ffdus-bitte-keine-portscans-auf-16/14469 ) |
what is the scenario: UnifiAP-Pro-LR rebooting, 50 client connect instantly.
I do not say "it's more like you suggested", but perhaps at least on the "per node limit" there could be something neccesary "wait for at least 5 minutes runtime" in extremly large setups.... or at least a UCI settings for x86 offloaders which feed large radio infrastructures. |
That's a totally different layer. After peer discovery IPv4 still needs to resolve the discovered IP addresses to a MAC addresses via ARP.
Well, the computer which receives the peer discovery package should (IIRC) learn the MAC/IP relation from the sending MAC/IP of the multicast-package, shouldn’t it?
The main issue with this filter-approach are roamings which result in a >60 seconds connection loss [...]
I fail to see how. Upon roaming, a device keeps its IP address(es) and connections. A device won't reissue ARP Requests on roaming, for instance.
If a roaming takes to long, e.g. like reconnecting for a minute to the next AP, the arp-cache-entry might be marked as stale and a arp might be issued again to validate the relation between IP and MAC-address depending on the OS.
I’ve seen my notebook (Linux) issuing from now and then a arp for the default gateway if there’s a very low traffic situation - in this case batman response with it’s arp-cache.
If I get something wrong here, feel free to correct me, I‘m not that deep into this topic.
—
|
Hm, you're right, I can reproduce the issue with mausezahn. This is what the Wireshark IO Graph shows me: I'll try to play with some other burst settings later today/tomorrow. Also, I'll try to understand the short but weird ebtables netfilter arp limit code. Maybe there is a bug in there. @RubenKelevra, how did you deactivate the burst option? When I'm setting just a 1/s limit and do not use the burst option, it still shows up with a burst=5 in "ebtables -L" to me. |
@T-X I‘ve just used 4/s without the burst option and tested it shortly but didn’t looked at the ebtables, the similarity from 4 to 5 should explain why iI didn’t noticed it. But if the burst-value is the same as limiter and the timeframe is seconds it should act like theres no burst-value at all, shouldn’t it? What do you think about a qdisc based approach with a tag, this should give us the ability to push it thru a fair queuing scheduler with paket pacing like cake and we won’t block different clients if one client is flooding. |
I found out what the issue is: "ebtables --atomic-save" does not save the state of the limitter. So it is actually the ebtables-worker run via micrond every minute resetting the burst limit back to 50... I'll look into / ask around whether somehow preserving these is possible. If someone would like to have a look at alternatives in the meantime, like qdisc, please do! (Although I'm wondering whether there is something simpler than cake/qdisc. We do not need fair bandwidth distribution, simply limiting by counting instead of queueing should technically be sufficient and less memory hungry.) |
The whole atomic-save code is something I don't like about this PR. I would rather like to see a small daemon (written in C?) keeping track of all the MAC addresses; that way, we never need to clear the whole table, but can just remove entries explicitly. |
So, basically, removing all this "--atomic-save/commit" and setenv in local function worker() in /usr/sbin/ebtables-worker seems to fix the limit reset. However I'm a little worried that adding/removing several rules dynamically per minute instead of this one atomic commit might have a noticeable impact on performance. (and it would currently cause a hiccup due to the flushing and then rebuilding approach if it were done on the live table, for sure) Edit: @NeoRaider, saw your reply just now, after writing this. Hm, yes, that might work, too. And probably easier than patching ebtables' atomic commands to safe the limit state. I'll look into that, thanks! |
is there progress? the idea of this package is really promising :) |
Hi @rotanid. Thanks for the interest and sorry for my silence regarding this for some time. I have implemented some daemon in C. But then even with these fine grained rule updates to ebtables, without the full table ebtables atomic commits, it seemed that the limiter were reset somehow. Seems like I will need to have a deeper look at the ebtables code again. I hope to be able to do this about next week. Cheers |
"--atomic-save/commit" seems to be unrelated to the limit reset issue. Any "ebtables -F/-I/-A/-D" seems to trigger the reset :-(. I'll try to find out whether it is possible to notice a reinitialization case and keep the old state from within ebt_limit_mt_check() in the Linux kernel code... Regarding an implementation within batman-adv. I would like to avoid opening that can of worms. At least for now. The concerns are not the overhead for maintaining the state (and yes, the batman-adv translation table uses a hash table and has an entry for each client MAC address in there). But that introducing various filters here and there in batman-adv without an overall architectural consideration/concept might introduce non-negligible performance penalties for the real data. |
b4f72c3
to
3723665
Compare
Phew, I think I found the issue in the kernel. I hope this two line fix does the trick. At least this seems to work according to my Wireshark captures. Further changes:
|
Just tested the patch "gluon-ebtables: Enable concurrent ebtables updates" on x86_64 and noticed that it doesn't create any rules during boot (but chains are created):
But they will be created when I restart /etc/init.d/gluon-ebtables. Might be important to know that I've added the patch to v2017.1.4. The device was booted using
|
One of the problems seems to be that the locking implementation is... weird. Please add the patch https://git.netfilter.org/ebtables/commit/?id=6a826591878db3fa9e2a94b87a3d5edd8e0fc442 to fix it. The patch for 2017.1.4 can be found in FreifunkVogtland@918d3c7 But this also doesn't fix the the actual problem which see. My chains are still empty. EDIT: The other problem is another bug in gluon-radv-filterd |
3723665
to
77e45b5
Compare
Changelog:
Regarding the discussion from yesterday of changing "batctl dc -H -n" to a query via /sys/kernel/debugfs/, to avoid spawning a new process: I'd need to change "batctl tl -H -n" to use netlink, too, then. I think it'd make more sense to leave it as is for now and replace both batctl calls to netlink once batman-adv has netlink support for the dat cache. |
return; | ||
} | ||
|
||
// Throw away a header line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you throwing a header line away when you called batctl without header lines (-H)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this output with batman-adv-legacy / batctl v2013.4:
root@freifunk647002d16222:~# batctl -v
batctl 2013.4.0 [batman-adv: 2013.4.0]
root@freifunk647002d16222:~# batctl tl
Locally retrieved addresses (from bat0) announced via TT (TTVN: 2 CRC: 0x17b1):
Client Flags Last seen
* 64:70:02:d1:62:22 [.P...] 0.000
root@freifunk647002d16222:~# batctl tl -H
Client Flags Last seen
* 64:70:02:d1:62:22 [.P...] 0.000
root@freifunk647002d16222:~#
So looks like a bug in that batctl version, hm. Could someone tell me the output they see with a more recent compat15 batman-adv/batctl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, funny, that was actually a patch from me fixing it some years ago:
https://git.open-mesh.org/batctl.git/commit/4aa06a7c2935158a2ff5ffea2995b011a62e229e
I'll backport it for batctl-legacy.
PS: And yes, current batctl does need need that extra skipped line, just confirmed it on some device. Thanks for reporting!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
root@PL-See68-DG:~# batctl -v
batctl 2017.2 [batman-adv: 2017.2]
root@PL-See68-DG:~# batctl tl
[B.A.T.M.A.N. adv 2017.2, MainIF/MAC: primary0/0e:79:de:08:8b:d3 (bat0/ac:86:74:00:ec:00 BATMAN_IV), TTVN: 215]
Client VID Flags Last seen (CRC )
ac:86:74:00:ec:00 -1 [.P....] 0.000 (0x8ace0d4d)
f4:09:d8:2b:51:78 -1 [....W.] 167.190 (0x8ace0d4d)
ac:86:74:00:ec:00 0 [.P....] 0.000 (0x55a68955)
root@PL-See68-DG:~# batctl tl -H
ac:86:74:00:ec:00 -1 [.P....] 0.000 (0x8ace0d4d)
f4:09:d8:2b:51:78 -1 [....W.] 170.620 (0x8ace0d4d)
ac:86:74:00:ec:00 0 [.P....] 0.000 (0x55a68955)
SECTION:=gluon | ||
CATEGORY:=Gluon | ||
TITLE:=Ebtables limiter for ARP packets | ||
DEPENDS:=+gluon-core +gluon-ebtables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you call batctl but don't depend on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
Adding batctl as a dependency directly was a little tricky though. For compat15 we do have a batctl package. However, for batman-adv-legacy/compat14, batctl is included within the gluon-mesh-batman-adv-14 package.
Therefore I now used gluon-mesh-batman-adv as a dependency. Which will be satisfied by gluon-mesh-batman-adv-14 or -15, which in turn will provide batctl or pull the batctl dependency respectively.
while (!feof(fp)) { | ||
pline = fgets(line, sizeof(line), fp); | ||
if (!pline) { | ||
fprintf(stderr, "%i: Error: fgets() failed\n", clock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hexa also mentioned that this line (and the dat one) print (to syslog) a lot of
Mon Jan 15 20:26:10 2018 daemon.err gluon-arp-limiter[1126]: 3184: Error: fgets() failed
Mon Jan 15 20:26:10 2018 daemon.err gluon-arp-limiter[1126]: 3184: Error: fgets() failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's verbose indeed. I think I was thinking that the "!feof(fp)" would catch that. I'll look into why that's not the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it, Stackoverflow to the rescue...
https://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong
T_X thank you for your work, this is awesome. Also even the first version of this package addresses a real issue of current Freifunk networks. Having this not ideal solution available in gluon while it gets discussed and refined could have been a real benefit for multiple communities during the last couple of months. Why do we need to distinguish between nodes an clients here instead of just keeping it simple? |
This is needed for the Gluon ARP limiter to work without hiccups in traffic. Link: https://patchwork.ozlabs.org/patch/841210/ Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Hi @christf, Thanks for your comment! The reason why I implemented both a per client and per node limit is the following: Generally I expect the majority of ARP spam coming from a specific MAC address. That's what we have observed so far. For that I have implemented a more strict, tighter per client limit. After cleaning up the code I noticed that I had generalized things so much that it was trivial to implement a both a per client and per node limit. The per node limit is there in case of an ARP spammer changing its source MAC with every packet. That's what I could imagine to happen for an intentionally malicious host trying to disrupt the network. With this, such a malicious host should have it more difficult to disrupt anything other than the local node via such ARP spamming. Also, the per node limit is there to throttle ARP for new clients which were not added with a per client limit yet (this daemon only adds/removes/updates per client limits every 30 seconds). (hope this answers your question?) |
This package adds filters to limit the amount of ARP Requests devices are allowed to send into the mesh. The limits are 6 packets per minute per client device, by MAC address, and 1 per second per node in total. A burst of up to 50 ARP Requests is allowed until the rate-limiting takes effect (see --limit-burst in the ebtables manpage). Furthermore, ARP Requests with a target IP already present in the batman-adv DAT Cache are excluded from the rate-limiting, both regarding counting and filtering, as batman-adv will respond locally with no burden for the mesh. Therefore, this limiter should not affect popular target IPs, like gateways. However it should mitigate the problem of curious people or smart devices scanning the whole IP range. Which could create a significant amount of overhead for all participants so far. Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
77e45b5
to
e6457ac
Compare
Changelog:
For batman-adv-legacy/compat14 networks: Make sure you have this new fix for batctl-legacy applied. |
We've used this package to succesfully combat ARP spam caused by a so-called "P2P Filesharing" Android app. We've also used this package at a crowded venue, where we had lots of uplink capacity. Neither caused any issue, the limits seem just fine to me and helped protect our network. 10/10 would install again. |
merging, as discussed in todays developer meeting. Thank you for your work! |
@T-X What's the state of upstreaming the kernel patch? |
I wasn't able to reproduce the issue in VMs yet. It seemed that on recent kernels and with a Debian kernel config, the ebtables limiting rules use the nft_limit kernel module instead of the ebt_limit.c code. The ebt_limit_init() is called, but neither ebt_limit_mt() nor ebt_limti_mt_check() were called. Instead nft_limit_eval() was. You can even see in lsmod that the usage count of nft_limit goes up by one for each ebtables limit rule inserted. And the nft_limit code seems to memorize the limit state just fine on table changes. I wasn't able to figure out what makes the kernel choose between these two different limiting code variants. |
Eventually I found out that I was not using the ebtables binary I thought I was on Debian. On a recent Debian /usr/sbin/ebtables can link to /usr/sbin/xtables-nft-multi. Which seems to be the new, preferred bridge filtering way. And it uses nft_limit instead of ebt_limit. I had resubmitted my patch for the legacy ebtables, rebased and retested with fixes for newer kernels (>= 4.11): https://patchwork.ozlabs.org/patch/1009914/ And got a negative response. The suggestion I got in the link above was to switch from ebtables-legacy to this ebtables-nft. I guess that's something that would need to be done upstream in OpenWrt (or the ebtables-tiny would need to be replaced with an ebtables-nft-tiny variant). |
This package adds filters to limit the amount of ARP Requests
devices are allowed to send into the mesh. The limits are 6 packets
per minute per client device, by MAC address, and 1 per second per
node in total.
A burst of up to 50 ARP Requests is allowed until the rate-limiting
takes effect (see --limit-burst in the ebtables manpage).
Furthermore, ARP Requests with a target IP already present in the
batman-adv DAT Cache are excluded from the rate-limiting,
both regarding counting and filtering, as batman-adv will respond
locally with no burden for the mesh. Therefore, this limiter
should not affect popular target IPs, like gateways.
However it should mitigate the problem of curious people or
smart devices scanning the whole IP range. Which could create
a significant amount of overhead for all participants so far.