PPing core improvements #32

simosund · 2021-12-10T15:41:50Z

Adds RTT-based sampling, ICMP echo support and making ePPing work in both directions, i.e. doing both timestamping and matching on both ingress and egress.

Also contains a couple of commits with fixes to the BPF code.

Commits based PR#23 but slightly cleaned up.

This PR is based on top of PR#31. On request I could try and rebase it on top of master instead (will likely cause some conflicts), but I expect PR#31 to be merged first at which point I would have to rebase this to the new master which would effectively undo all the work in the first rebase.

tohojo · 2021-12-17T18:41:49Z

Right, so I'm afraid I won't have time to properly review this before the holidays, sorry about that :(

Will circle back to it in January :)

simosund · 2021-12-17T21:20:43Z

No worries, I've gone on vacation myself now. Merry christmas and happy new year!

tohojo · 2022-01-17T18:29:11Z

Hmm, seems Github didn't pick up on the first bit of this being merged. Could you please do a manual rebase onto master and re-push this?

simosund · 2022-01-17T18:40:41Z

Hmm, seems Github didn't pick up on the first bit of this being merged. Could you please do a manual rebase onto master and re-push this?

Hmm strange, maybe because this series of commits was missing the merge-commit? Anyways, have rebased on the new merged master, so should hopefully be fine now

tohojo

Most of the comments are relatively minor, I guess. My main overall comment is that you still have a tendency to lump unrelated things together in single commits, which makes reviewing harder than it needs to be. When splitting things into commits, try to think of it this way: each commit should be self-contained in that it doesn't require any context other than the code changes themselves and its commit message. You should generally be making the commits as small as possible, but not so small that they lack context. It's fine to do partial refactorings, though, as long as you call out upcoming changes ("change X in preparation for later doing Y"). In general, think of it as minimising the amount of stuff the reviewer has to keep in their head at once (which is also helpful when writing commit messages; they are there to help the reviewer, and yourself when you later go back and run git blame to figure out why some code is the way it is).

I'm also not a huge fan of the "issues with this commit" items at the bottom of each commit message. I view them as a bit of a cop-out: the issues were obviously not important enough to fix, or you would have done that already. And the commit message should be describing the change you are making, not all the other stuff you thought about while making that change :) Tracking TODO items is better done in a TODO file, or an issue tracker (and feel free to use the issue tracker in this repository, BTW). If you do use a TODO file, add a separate commit at the end of your series updating that; this also helps you scope out the series (or pull request, as it were). Like "This series did X, Y and Z, and now I'm updating the TODO list with future items A, B and C that I will tackle in a subsequent series).

Oh, and the above comments are mostly for future reference - you don't need to restructure the commits in this series! And yeah, splitting changes up in readable chunks can be a lot of tedious work, but it gets easier with practice; you'll learn to think in terms of the changes and how you will describe them, which can also improve the code (since you'll sometimes think about and fix issues while you're doing such a split).

tohojo · 2022-01-31T22:05:41Z

pping/pping_kern.c

+					BPF_NOEXIST) == 0) {
+			new_flow = true;
+
+			if (fe.event_info.event != FLOW_EVENT_OPENING) {


This seems a bit dodgy; why are you keeping an (uninitialised!) flow even structure in the global function state and filling it in piece-mal, instead of just allocating it in the block where you need it? That seems like it's prone to bugs.

So after cursing my old self any trying to figure out what I was thinking at the time, I believe that I initially did this in some attempt to optimize the code by only filling in the fields that where needed when they were needed. With the commit that made PPing "symmetrical" (doing both timestamping and matching on both ingress and egress) I essentially gave up on this idea, but for some reason kept passing this struct around between functions (even if the function itself filled all the members, meaning it could just as well allocate it itself).

I have addressed this in the commit that makes PPing symmetrical, so it is now only allocated and filled inside functions which directly push it to the perf-buffer (as the only reason I ever fill it is so I can push it on the perf-buffer). I've opted to not go back and try and fix it before this commit however, as I rather not spend energy rewriting code that is overwritten by later changes in this PR anyways (and it was also a problem which existed since before this PR).

tohojo · 2022-01-31T22:06:03Z

pping/pping_kern.c


+		f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
 		if (!f_state) // Creation failed


State creation failures should probably be notified somehow at least occasionally?

Hmm, ocassionally emit a warning to stderr about failing to track flow due to flow-state map being full? Definitly need to make sure that it goes out with a limited frequency though, so that not every single packet triggers a warning message which would tank performance in a period where PPing is likely already under a heavy load.

Ideally I guess one would perhaps get one message per flow which failed to create a new state, but I don't really see a way to track that I've already emitted a warning for a particular flow without keeping some state for that flow...but the reason for emitting this warning to begin with is that we're out of space for tracking flows.

So simply some time-based limit? Although I'm not really sure what frequency would make sense here. 1000 warnings/sec? 100 warnings/sec? 10 warnings/sec?

Once a second would probably be fine, at least in interactive mode... You could just keep a timestamp in a global variable and have logic like:

if (!f_state) { // creation failed if (last_warn_time < now - WARN_INTERVAL) { emit_warning(); // send a perf event of a special type, for instance last_warn_time = now; } }

Have added a new commit at the end that should address this. In addition to sending a warning if it fails to create a flow state entry (due to the flow state map being full), it also sends a warning in case it fails to create a timestamp entry.

Both are capped to only emitting one warning per second (currently hard-coded, but could of course be made configurable in case you believe it's something the user might want to change), but tracked separately for the flow state and timestamp map.

tohojo · 2022-01-31T22:13:15Z

pping/pping.c

@@ -106,6 +103,7 @@ static const struct option long_options[] = {
 	{ "cleanup-interval", required_argument, NULL, 'c' }, // Map cleaning interval in s, 0 to disable
 	{ "format",           required_argument, NULL, 'F' }, // Which format to output in (standard/json/ppviz)
 	{ "ingress-hook",     required_argument, NULL, 'I' }, // Use tc or XDP as ingress hook
+	{ "localfilt-off",    no_argument,       NULL, 'l' }, // Disable local filtering (will start to report "internal" RTTs)


localfilt-off is a terrible name :)
How about --include-local instead? Could this be -l as the shortopt...

Agreed. Think I first called it just localfilt, but figured that was a very misleadning name as the option turns it off not on, so my quick-fix was to just append off to the end =) Wanted to keep -l to be consistent with Kathie's PPing, but guess that's still fairly reasonable with --local-filt (which is also a much more descriptive name).

See I ended up writing --local-filt in my previous comment, but of course meant to change it to your suggested --include-local which I've now also done.

tohojo · 2022-01-31T22:19:59Z

pping/pping.h

@@ -135,4 +138,16 @@ union pping_event {
 	struct flow_event flow_event;
 };

+/*
+ * Copies the src to dest, but swapping place on saddr and daddr


Comments should explain the "why" of a piece of code, not just reiterate the "what" (for that you can just read the code). This function in itself is obvious enough from the name, but why is it needed?

Have changed the comment to instead try to motivate what the function is useful for.

tohojo · 2022-01-31T22:20:23Z

pping/pping_kern.c

@@ -25,6 +25,8 @@
 #define AF_INET6 10
 #define MAX_TCP_OPTIONS 10

+#define IPV6_FLOWINFO_MASK __cpu_to_be32(0x0FFFFFFF)


What's this for?

Don't really know. Pretty much just copied how you filled in the bpf_fib_lookup in sample/bpf/xdp_fwd_kern.c and then you used this thing so I pretty much followed blindly. Should probably take some time figuring out how bpf_fib_lookup struct is actually used like, but remeber that I thought the struct seemed rather complicated and was unsure which fields needed to be set and with what.

Yeah, good idea ;)
It's commented in the struct, to a certain extent: https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/bpf.h#L6121

Have added a comment that this is for getting the flow label + traffic class from the IPv6 header which apparently is used in the lookup.

tohojo · 2022-01-31T22:24:19Z

pping/pping_kern.c

+	}
+
+	lookup.family = p_info->pid.flow.ipv;
+	lookup.l4_protocol = p_info->pid.flow.proto;


Are you sure you should be filling this if you're not filling in the ports?

No idea, as above, I pretty much just copy-pasted the one example I found of how this thing was used, and there the l4 protocol is filled in while the ports are set to 0.

Still not sure about this one honestly, both examples of it that I've found (xdp_fwd_kern and test_tc_neigh_fib) does it like that. Guess there might be forwarding rules based solely on the protocol, i.e. forward TCP traffic but not UDP traffic, without knowledge of the ports. So have left it like this for now.

Question is if it really matters for our use-case though, which is to find out if the packet is going to the same machine that PPing is running on or not. Perhaps it makes more sense to only provide it with IP addresses, as I guess the protocol shouldn't really affect if the local machine is the intended destination of the packet or not.

tohojo · 2022-01-31T22:34:17Z

pping/pping_kern.c

@@ -517,10 +648,11 @@ int pping_tc_egress(struct __sk_buff *skb)
 		.data_end = (void *)(long)skb->data_end,
 		.pkt_len = skb->len,
 		.nh = { .pos = pctx.data },
+		.ifindex = skb->ingress_ifindex,


Setting an ifindex member to the value of ingress_ifindex in an egress hook should make alarm bells go off ;)

AFAICT you're only using this for the fib lookup, so just leave it unset here (and rename the member to ingress_ifindex as well, for the avoidance of doubt)?

Good point.

tohojo · 2022-01-31T22:35:07Z

pping/pping_kern.c

+{
+	struct packet_info p_info = { 0 };
+	struct flow_state *f_state;
+	struct flow_event fe;


Even though you moved the code around, the comment about having this shared uninitialised variable on the stack still applies :)

tohojo · 2022-01-31T22:42:41Z

pping/pping_kern.c

+		fe->flow_event_type = FLOW_EVENT_OPENING;
+		fe->reason = f_state->opening_reason;
+		fe->source = EVENT_SOURCE_PKT_DEST;
+		fe->reserved = 0;


Just to illustrate my point about the flow events, the above is basically the same as the full initialiser, so there's no reason to pass around pointers to struct flow_event between the different functions. Just create it on the stack where you need it :)

tohojo · 2022-01-31T22:46:12Z

pping/pping_kern.c

+	if (p_id->flow.proto == IPPROTO_TCP)
+		err = parse_tcp_identifier(ctx, &saddr->port, &daddr->port, fei,
+					   &p_id->identifier);
+	else if (p_id->flow.proto == IPPROTO_ICMPV6 &&


I would expect ICMP packets to only be included if the user asked for it? I.e., there should be a command line option to switch between parsing different packet types (with an 'all' option for all supported flow types I suppose). This also gets you past the problem with the ppviz output not including any protocol information: You could just refuse to output that format if the user selects anything other than TCP :)

Hmm, while having it optional definitly makes sense, I would assume the "all" would be the default, like it would be for ex pcap (i.e. you can add filters for different protocols if you want, but by default it captures everything it can).

But are you suggesting we add a flag for each supported protocol the user wants to track (plus a flag for all), and then assume TCP if user specifies no flags? Or should TCP always be included by default and we instead have a flag if you explicitly want to exclude TCP, whereas for all the other protocols the flag is used to include it?

I think the first makes most sense. A flag for each protocol, a flag for all, and if nothing is specified for the protocol part use TCP as default. But if one or several protocols are explicitly specified this overrides the default and then you use what is specified. The default should be what you expect to be most commonly used I think.

I don't have any strong opinions on what the default should be, just that it should be configurable :)

Have gone ahead and added separate flags for enabling TCP (--tcp/-T), ICMP (--icmp/-C) and a shorthand for all supported protocols (--all/-A). If no flag is specified the user is notified that it assumes (only) TCP.

simosund · 2022-02-01T09:17:16Z

Thanks for the extensive feedback, will try to fix most of it tomorrow.

As for "issues with this commit" part, I don't intend for them to be cop outs, but rather I want to highlight things that I think might be potential issues. Genereally I don't consider them critical (in that case I would hopefully have fixed them), but by making them very visibel my hope is essentially that you might disagree and find that it completely breaks the application, in which case we can have a discussion about how to fix it ASAP. I also want to note them down while I still remeber them.

That said, I realize that the commit messages might not be the best place for that. While I actually have a TODO list I've been pretty bad at updating it, and generally view it more as a list of additional features rather than for tracking potential issues. I also have a list of potential concurrency issues in the README (which was initially spread out as a bunch of comments in the code), so maybe it would make sense to consolidate all that in a dedicated document, or add a separate section in the TODO list with them? While the issue tracker might also be a good way to do that, I have a feeling that it should mainly be used for issues in the current master, not for potential issues within a PR which may have been fixed by the time they're merged?

tohojo · 2022-02-01T22:27:03Z

Hmm, if the issues you point out are meant for review, I think it's better to include them in the pull request description (which would be the cover letter in an email submission). If they're in the commit, they're a bit ambiguous as well ("issues with this commit" - does that mean they will be fixed with a subsequent commit?). And if they're longer-term things, put them in the TODO or somewhere similar.

And yes, if you already have fixed queued up in a subsequent pull request, there's no reason to open an issue. But then that is what the text in the PR description should say. If they're longer term issues, they can be tracked in issues just fine - and if you're not sure, put them in the PR description and ask :)

The echoed TCP timestamp (TSecr) is only valid if the ACK flag is set. So make sure to only attempt to match on ACK packets. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Allow pping to passivly monitor RTT for ICMP echo request/reply flows. Use the echo identifier as ports, and echo sequence as packet identifier. Additionally, add protocol to standard output format in order to be able to distinguish between TCP and ICMP flows. The ppviz format does not include protocol, making it impossible to distinguish between TCP and ICMP traffic. Will add warning if ppviz format is used together with ICMP traffic in the future. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

simosund · 2022-02-03T17:22:21Z

Hi again @tohojo. Think I've addressed most of your comments now (other than the main one about grouping unrelated changes together in the commits, as I've not split up any of the existing commits...). I have however removed the "issues with this commit" parts from the commit messages.

Most of the commits before "timestamping + matching on both ingress and egress" are largely the same, just that I've added an additional commit there which adds command line flags for which protocols PPing should track.

Have done some additional refactoring in the ingress+egress commit to make a separate function for sending flow-events (which allocate their own flow-event structs rather than passing the same one around) and some other minor fixes and cleanup. The "consider flow opened on reply" commit has also undergone a few changes to be more consistent with the changes made in ingress+egress commit.

Finally, I've added a couple of commits which adds warnings if case PPing fails to create map entries, as well as a separate commit where I add section on "potential problems" to the TODO document. There I've summarized the various issues were previously part of the commit messages.

Finally, some of the other notes that were part of the commit messages before but would probably fit better in the PR description are:

PPing is probably less performant now as it essentially does twice the amount of work (minus parsing the packet), with both timestamping and matching on each packet as well as updating the flow state in both directions.
The verifier has a much easier time verifying the BPF programs now (processes much fewer instructions), despite them being more complex.
PPing no longer works on a 5.4 kernel (or at least it didn't at the time of the original commits).

tohojo

A few nits on the mode selection, but otherwise I think we should get this merged so you can make progress on other items :)

pping/pping.c

Add command-line flags for each protocol that pping should attempt to parse and report RTTs for (currently -T/--tcp and -C/--icmp). If no protocol is specified assume TCP. To clarify this, output a message before start on how ePPing has been configured (stating output format, tracked protocols and which interface to run on). Additionally, as the ppviz format was only designed for TCP it does not have any field for which protocol an entry belongs to. Therefore, emit a warning in case the user selects the ppviz format with anything other than TCP. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Only push flow events for opening/closing flows if the creation/deletion of the flow-state was successful (as indicated by the bpf_map_*_elem() return value). This should avoid outputting several flow creation/deletion messages in case multiple instances are trying to create/delete a flow concurrently, as could theoretically occur previously. Also set the last_timestamp value before creating a new flow, to avoid a race condition where the userspace cleanup might incorrectly determine that a flow is old before the last_timestamp value can be set. Explicitly skip the rate-limit for the first packet of a new flow to avoid it failing the rate-limit. This also fixes an issue where the first packet of a new flow would previously fail the rate-limit if the rate-limit was higher than current time uptime (CLOCK_MONOTONIC). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Add an option (-R, --rtt-rate) to adapt the rate sampling based on the RTT of the flow. The sampling rate will be C * RTT, where C is a configurable constant (ex 1.0 to get one sample every RTT), and RTT is either the current minimum (default) or smoothed RTT of the flow (chosen via the -t or --rtt-type option). The smoothed RTT (sRTT) is updated for each calculated RTT, and is calculated in a similar manner to srtt in the kernel's TCP stack. The sRTT is a moving average of all RTTs, and is calculated according to the formula: srtt = 7/8 * prev_srtt + 1/8 * rtt To allow the user to pass a non-integer C (ex 0.1 to get 10 RTT samples for every RTT-period), fixed-point arithmetic has been used in the eBPF programs (due to lack of support for floats). The maximum value for C has been limited to 10000 in order for it to be unlikely that the C * RTT calculation will overflow (with C = 10000, overflow will only occur if RTT > 28 seconds). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Perform both timestamping and matching on both ingress and egress hooks. This makes it more similar to Kathie's pping, allowing the tool to capture RTTs in both directions when deployed on just a single interface. Like Kathie's pping, by default filter out RTTs for packets going to the local machine (will only include local processing delays). This behavior can be disabled by passing the -l/--include-local option. As packets that are timestamped on ingress and matched on egress will include the local machines processing delay, add the "match_on_egress" member to the JSON output that can be used to differentiate between RTTs that include the local processing delay, and those which don't. Finally, report the source and destination addresses from the perspective of the reply packet, rather than the timestamped packet, to be consistent with Kathie's pping. Overall, refactor large parts of pping_kern to allow both timestamping and matching, as well as updating both the flow and reverse flow and handle flow-events related to them, in one go. Also update README to reflect changes. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Wait with sending a flow open message until a reply has been seen for the flow. Likewise, only emit a flow closing event if the flow has first been opened (that is, a reply has been seen). This introduces potential (but unlikely) concurrency issues for flow opening/closing messages which are further described in the README. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Refactor code for how events are handled in the user space application. Preparation for adding an additional event type which should not be handled by the normal functions for printing RTT and flow events. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Send a warning notifying the user that PPing failed to create a flow/timestamp entry due to the corresponding map being full. To avoid sending a warning for every packet, only emit warnings every WARN_MAP_FULL_INTERVAL (which is currently hard-coded to 1s). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

Collect potential issues under a new section in the TODO list. These are issues I generally don't think are that severe, but may still be useful to note down and keep in mind. Move the section on potential concurrency issues from README to the new section in the TODO-list. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

This was referenced Dec 10, 2021

PPing - Add introduction and figure description #33

Draft

PPing - Improve map cleanup #34

Merged

simosund force-pushed the pping_improve_capabilities branch 2 times, most recently from 7d7c8fe to 0a80d22 Compare December 17, 2021 17:38

simosund force-pushed the pping_improve_capabilities branch from 0a80d22 to 732e606 Compare January 4, 2022 16:28

simosund force-pushed the pping_improve_capabilities branch from 732e606 to 811ad5b Compare January 17, 2022 18:08

simosund force-pushed the pping_improve_capabilities branch from 811ad5b to 4a8d9fd Compare January 17, 2022 18:38

tohojo requested changes Jan 31, 2022

View reviewed changes

simosund added 2 commits February 2, 2022 10:14

pping: Only match TSecr in ACKs

af5e660

The echoed TCP timestamp (TSecr) is only valid if the ACK flag is set. So make sure to only attempt to match on ACK packets. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>

simosund force-pushed the pping_improve_capabilities branch from 4a8d9fd to 3ff729d Compare February 3, 2022 15:33

tohojo requested changes Feb 9, 2022

View reviewed changes

pping/pping.c Show resolved Hide resolved

pping/pping.c Outdated Show resolved Hide resolved

pping/pping.c Outdated Show resolved Hide resolved

simosund added 8 commits February 10, 2022 16:02

simosund force-pushed the pping_improve_capabilities branch from a042d04 to b8215e7 Compare February 10, 2022 15:19

tohojo approved these changes Feb 10, 2022

View reviewed changes

tohojo merged commit dd24bce into xdp-project:master Feb 10, 2022

simosund deleted the pping_improve_capabilities branch March 22, 2022 13:40


		f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
		if (!f_state) // Creation failed

PPing core improvements #32

PPing core improvements #32

Conversation

simosund commented Dec 10, 2021

tohojo commented Dec 17, 2021

simosund commented Dec 17, 2021

tohojo commented Jan 17, 2022

simosund commented Jan 17, 2022

tohojo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simosund commented Feb 1, 2022

tohojo commented Feb 1, 2022

simosund commented Feb 3, 2022

tohojo left a comment

Choose a reason for hiding this comment