-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPing - Improve map cleanup #34
PPing - Improve map cleanup #34
Conversation
Now that PR#32 has been merged I will rebase this and probably do some minor touchups. Once I'm done with that I will remove the draf-status. |
b12cf97
to
4ae060e
Compare
I have now rebased this PR on the current master, done a little cleanup based on comments from previous PRs, and also changed the a couple of things I thought was a bit clumsy in my initial PR. For reference, the state of this PR before these changes can be found on this branch (which has not been rebased). The first major change from this old version is that I used to send the debug information from the BPF iter programs to the userspace process through an additional map that I would read right after I had triggered the BPF iter programs that updated them in userspace. I now instead treat the debug information as a new type and simply push them through the perf-buffer like all other evens, which requires much fewer changes in the userspace code. The second major change from the old version is related to adding RTT-based timeouts to the timestamp entries. In the old version I added the current RTT to every timestamp entry to avoid doing a flow lookup for each timestamp entry in the BPF iter program. However, this adds additional information stored for each timestamp entry, and as the normal ePPing program does a flow lookup for every (valid) packet anyways, the overhead of doing a lookup for a few thousand entries every cleanup period (by default 1 second) is likely negligible. So I changed my mind on this and decided to leave the timestamp entries unchanged and simply do the lookup in the BPF-iter program instead. |
4ae060e
to
9821231
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few nits on this one :)
pping/pping.c
Outdated
} | ||
linfo.map.map_fd = bpf_object__find_map_fd_by_name(obj, map_name); | ||
if (linfo.map.map_fd < 0) | ||
return ERR_PTR(linfo.map.map_fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're making a lot of assumptions about the inner workings of libbpf_get_error() here (and below). This is bound to break (its semantics are already changing upstream), so I don't think this is a good idea (and it also happens to be a bit ugly).
I'd suggest you change this function to return an int error message, and add a struct bpf_link **link
parameter to put the output in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed as you suggested to simply returning an int error message and instead pass out the link reference through a parameter. Still use libbpf_get_error()
to catch errors for functions that return pointers (as I assume that's what the function is for), but no longer attempt to return error codes masked as pointers myself.
While fixing I also noticed another bug which I fixed. The if (linfo.map.map_fd < 0)
check would always fail as linfo.map.map_fd
is a __u32 and thus cannot be negative. Therefore I now first save the result from bpf_object__find_map_fd_by_name()
in an int which I use for the error check, before assigning it linfo
. Alternatively I guess I could also just cast linfo.map.map_fd
for the check, but then I have to assume that ints are always 32 bits.
9821231
to
54cf69f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, but there's a bug; and spotted another issue while looking at the code again...
pping/pping.c
Outdated
return (__u64)t.tv_sec * NS_PER_SECOND + (__u64)t.tv_nsec; | ||
} | ||
*link = bpf_program__attach_iter(prog, &iter_opts); | ||
err = libbpf_get_error(link); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong pointer (link
, not *link
). But rather than fixing that, it's better to use a local stack variable to store the return from bpf_program__attach_iter()
and only write to *link
after you've done the error check. That way the caller can be sure that nothing will be written to the pointer passed in unless the operation succeeds, which is standard practice for APIs like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good idea. Have changed this now.
pping/pping.c
Outdated
fprintf(stderr, | ||
"Failed attaching cleanup program to flow map: %s\n", | ||
get_libbpf_strerror(err)); | ||
return err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leaks the link created above...
In fact I don't think you ever destroy the links properly? There should be some calls to bpf_link__destroy()
somewhere on shutdown...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is indeed the case. I haven't really fully grasped what these links even are and guess I pretty much just assumed they would automatically be destroyed and cleaned up on program exit together with all the other bpf objects (programs, maps etc. that aren't pinned).
To fix this I will need to restructure the code a bit to expose the links outside of this function so that I can access them later.
However, while thinking about this I just noticed what I think should be a much bigger error in the same function (setup_periodical_map_cleaning()
). I allocate the arguments (including these links) which will be used in a different thread on the stack... So the arguments my cleanup function (periodic_map_cleanup()
) use in a separate thread may be overwritten by whatever other functions I call in main later. Either I'm just not thinking straight right now, or it's a miracle that this hasn't caused all sorts of funky errors already.
I should probably also handle my cleanup thread on shutdown, so I don't just pull out these links underneath it and hope it never has time to notice before the entire program terminates. So will go ahead and add a separate commit at the beginning of this PR that address these issues, then I can clean up the links in a sensible manner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with other words, it's not really ready for re-review yet as I have not properly addressed your point on cleaning up the links. But before I do that I want this issue with the arguments for separate thread being on the stack of a non-main function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good. Let me know when you want me to take another look...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function (setup_periodical_map_cleaning) should now properly destroy the links if fails. Furthermore, the links should be destroyed in the final teardown of the program (as long as they've been set up that is).
54cf69f
to
d081abc
Compare
Allocate the clean_args on the stack of the main function rather than the stack of setup_periodical_map_cleaning. The clean_args is used by periodical_map_cleanup from a different thread, so allocating them on stack for setup_periodical_map_cleaning which goes out of scope directly after the thread is created opens up for errors where later function calls may overwrite the arguments, causing unpredictable behavior. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Explicitly stop the thread which performs periodical map cleanup on shutdown, before attempting to free up any resources it might use. For the cleanup order to make more sense, also setup the perf-buffer before setting up the periodical map cleaning, so that the thread can be stopped as the first part of the cleanup. This also matches better with the next couple of commits where map cleaning debug information will be pushed through the perf-buffer. Finally, move the addition of the signalhandler earlier in the code to eliminate a window where it was possible to terminate the program without relevant cleanup code having a chance to run. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
d081abc
to
7bf5316
Compare
To improve the performance of the map cleanup, switch from the user-spaced loop to using BPF iterators. With BPF iterators, a BPF program can be run on each element in the map, and can thus be done in kernel-space. This should hopefully also avoid the issue the previous userspace loop had with resetting in case an element was removed by the BPF programs during the cleanup. Due to removal of userspace logic for map cleanup, no longer provide any debug information about how many entires there are in each map and how many of them were removed by the garbage collection. This will be added back in the next commit. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add some debug info to the periodical map cleanup process. Push debug information through the events perf buffer by using newly added map_clean_event. The old user space map cleanup process had some simple debug information that was lost when transitioning to using bpf_iter instead. Therefore, add back similar (but more extensive) debug information but now collected from the BPF-side. In addition to stats on entries deleted by the cleanup process, also include stats on entries deleted by ePPing itself due to matching (for timestamp entries) or detecting FIN/RST (for flow entries) Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add conditions that allows removing old flow and timestamp entries sooner. For flow map, have added conditions that allow unopened flows and ICMP flows to be removed earlier than open TCP flows (currently both set to 30 sec instead of 300 sec). For timestamp entries, allow them to be removed if they're more than TIMESTAMP_RTT_LIFETIME (currently 8) times higher than the flow's sRTT. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Also add description of three potential issues introduced by the changes to cleanup process. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
7bf5316
to
e7201c9
Compare
Hi again @tohojo, I've now added two commits at the beginning of this PR that fix some errors related to the setup and teardown of the map cleaning in the userspace program. The later commits are largely the same, but builds on top of these commits to properly destroy the links as you pointed out in your comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, just one small comment / question, otherwise LGTM.
For future reference, though, if you find yourself listing lots of changes done in a single commit (like in the second commit in this series), that is a hint that the commit is not really a single self-contained unit and you might want to split each change into a separate commit. Look for things like "additionally" and "finally" in the commit message to detect his :)
These commits moves the cleanup logic for stale entires from userspace to a BPF iter program. So instead of userspace periodically looping through the maps clearing out old entries, it only triggers the BPF programs periodically, and the BPF program iterates through all the entries removing those which are considered old.
Also adds some slightly more complex logic for when an entry can be removed, for example ICMP flows and unopened TCP flows can be cleared faster than an open TCP flow, and timestamp entries with low RTTs can be cleared much faster than the static 10s limit.
All these commits were created with the changes from PR#23 in mind, more specifically with those that are now in PR#32. While the first two commits could potentially work without PR#32, the commit with more aggressive cleanup is based on the existance of ex. ICMP flows and "unopened" flows, and therefore makes little sense to merge before PR#32. I will therefore leave this as a draft for now.