Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous pping fixes #23

Closed
wants to merge 14 commits into from

Conversation

simosund
Copy link
Contributor

Various different fixes for pping. Currently simply moved the ones that were previously part of the PR for improving the output, but had nothing to do with actually improving the output. Will likely add a bunch of other more or less unrelated improvements here as well.

@simosund
Copy link
Contributor Author

These fixes are based on the improve output PR, so once that is merged I can rebase this to only include the non-output stuff.

@simosund simosund force-pushed the pping_miscellaneous_fixes branch from e9f4308 to a7eec49 Compare June 23, 2021 13:28
@tohojo
Copy link
Member

tohojo commented Jun 24, 2021

Alright, merged the output PR; please rebase this one...

@simosund simosund force-pushed the pping_miscellaneous_fixes branch 2 times, most recently from 1291181 to 37b4aa8 Compare June 24, 2021 13:03
@simosund
Copy link
Contributor Author

Alright, merged the output PR; please rebase this one...

Ok, should be done now. And by merging it I guess you thought my rebased commit history was ok, but feel free to let me know if there were any problems with it or other ways I could improve it in the future.

I decided to start pushing commits that aren't entierly finished to my branch, partly so you can see what I'm working on if you're interested, and partly so that I have a remote backup also of unfinished work. I've marked those with WIP. This means I'll end up doing quite a few force-pushes (mainly to amend my latest WIP commit, but will probably need to do some additional cleanup before we merge anyways). If you think this is a bad idea, let me know and I'll stop.

@simosund simosund force-pushed the pping_miscellaneous_fixes branch 2 times, most recently from 4d8a2df to c09dc99 Compare June 30, 2021 16:29
@simosund
Copy link
Contributor Author

A couple of potential issues I've noticed that I wanted to write down while I still remeber them. If you have any thoughts on them, feel free to comment.

  • I consider the src/dst adress from the context of the sent (timestamped) packet, whereas Kathie considers them from the perspective of the response (matched) packet. In practice this means that I write out the addresses in reverse order of Kathie's pping. I know that I originally noticed and fixed this a while ago, but forgot it again when adding and rewriting all the output formats.

    • For the ppviz format this should definitly be changed so that we actually match the expected format
    • For the "standard"/default output format, it should probably also be changed as it is quite similar to the default output of Kathie's pping
    • For the JSON format, it is a bit unclear to me if it should be changed (as it's the only format that explicitly labels what is src and what is dest). As far as I'm concered, it makes more sense to look at it from the perspective of the sent packet (that is the packet that will experience the reported RTT) rather than the response packet as Kathie does. However it might be good to change it in order to be "consistent" with the other formats (although the other formats don't explicitly label what is the src and dest part, I think most people would assume that the src address comes first, and that is how it's refered to in the ppviz format documentation)
  • Are my choices for short/long option flags good (you can find them in long_options at the beginning of pping.c)? I've kinda just selected them add-hoc as I've implemented some new feature, and feel like they may not be optimal in the long run. If pping starts to be used by other people it's probably not a great idea to change flags around at that stage, so should probably try to plan it out a bit early. I should probably also write a better help-message, as right now it just lists the options without any explenation of what they do (can only be found as comments in the source code).

  • The user-space cleanup uses the last_timestamp of a flow, which indicates that last time a timestamp was created for the flow (or would have been created in case the timestamp map was full at the moment), to determine if the flow is active or not. If one use a very long sampling interval, it is possible that the userspace could clean up an active flow simply because its rate limit doesn't allow it to create new timestamps. This in turn would allow it to actually bypass the rate-limit as the next packet from the flow would then be considered to come from a new flow after the userspace deleted the previous one. This does not seem like a very serious issue to me because using such a low sampling rate (userspace only cleans up flows after 300 sec) seems quite unlikely. The primary reason I would see for it would be to support a very large amount of flows (make it very unlikely that the timestamp map fills up by simply producing very few timestamps per flow), in which case it might be contra-productive to keep around all the flow-states for a really long time anyways.

@simosund simosund force-pushed the pping_miscellaneous_fixes branch 3 times, most recently from afe8291 to c3d5df9 Compare July 12, 2021 16:27
simosund added 11 commits July 12, 2021 19:54
Unpin tc program directly after attaching it (rather than on program
shutdown), so that multiple instances of pping can be run
simultaneously (on different interfaces).

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add cleanup code to load_attach_bpfprogs function, so it should always
unpin tc program, and additionally detach the tc and xdp programs in
case of any failure.

Also rename some of the functions for attaching/detaching tc programs
to be more consistent with the xdp ones.

Still need to keep a copy of most of the cleanup code in main as well,
as the tc and xdp programs also need to be detached on program
shutdown or if later functions fail.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
libbpf v0.4 added an API for attaching/detaching TC-BPF programs. So
use the new API to attach the tc program instead of calling on an
external script (which uses the tc command line utility).

Avoid removing the clsact qdisc on program shutdown or error, as
there's currently no convenient way to ensure the qdisc isn't used by
other programs as well. This means pping will not completely clean up
after itself, but this is a safer alternative than always destroying
the qdsic as done by the external script, which may pull the rug out
underneath other programs using the qdisc.

Finally, remove the pin_dir member from the configuration as pping no
longer pins any programs or maps, and remove deleted tc loading
scripts from README.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The echoed TCP timestamp (TSecr) is only valid if the ACK flag is
set. So make sure to only attempt to match on ACK packets.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Allow pping to passivly monitor RTT for ICMP echo request/reply
flows. Use the echo identifier as ports, and echo sequence as packet
identifier.

Additionally, add protocol to standard output format in order to be
able to distinguish between TCP and ICMP flows.

Potential concerns with this commit:
- Is starting to approach verifier limit again (850k ins)
- ppviz format does not include protocol, so cannot distinguish
  between TCP and ICMP flows
- Cannot detect when ICMP flows close, so they will take up a
  flow-state entry until cleaned out by userspace
  - Userspace cleanup has a very long timeout of 5 minutes

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
For some machines, XDP may not be suitable due to ex. lack of XDP
support in NIC drivers or another program already being attached to
the XDP hook on the desired interface. Therefore, add an option to use
the tc-ingress hook instead of XDP to attach the pping ingress BPF
program on.

Also, change the tc programs to return TC_ACT_OK instead of
BPF_OK. While both should be compatible, the TC_ACT_* return codes
seem to be more commonly used for TC-BPF programs.

Concerns with this commit:
- While only the XDP XOR the tc ingress program is attached, both of
them are loaded
  - This also worsens the long and CPU heavy load when starting pping
- The error messages for XDP attach failure has gotten slightly less
descriptive. I plan to improve the code for attaching and detaching
XDP programs in a separate commit, and will then address that.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Make several changes to functions related to attaching and detaching
the BPF programs:
- Check the BPF program id when detaching programs to ensure that the
  correct programs are removed.
- When attaching tc-programs, keep track of if the clsact qdisc was
  created or existed previously. Attempt to delete the qdisc if it was
  created and attaching failed. If the --force argument was given, also
  attempt to delete qdisc on shutdown in case it did not previously
  exist.
- Rely on XDP flags to replace existing XDP program if --force is used
  rather than explicitly detaching any XDP program first.
- Print out hints for why pping might have failed attaching the XDP
  program.

Also, use libbpf_strerror instead of strerror to better display
libbpf-specific error codes, and for more reliable error handling in
general (don't need to ensure the error codes are positive).

Finally, change return codes of tc programs to TC_ACT_UNSPEC from
TC_ACT_OK to allow other TC-BPF programs to be used on the same
interface as pping.

Concerns with this commit:
- The libbpf submodule has not yet been updated with the patch that
  makes bpf_tc_hook_create return EEXIST if the hook already
  exists. Therefore, pping will incorrectly think it created the qdisc
  when it previously existed, and may incorrectly attempt to delete it
  on failure or program shutdown.
- The tc_attach and xdp_attach functions attempt to return the u32
  prog_id in an int. In case the programs are assigned a very high
  id (> 2^31) this may cause it to be interpreted as an error instead.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The rate-limit and cleanup-interval arguments were only verified to be
positive. Add a check for an upper bound to avoid user being able to
pass values that result in an internal overflow.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Only push flow events for opening/closing flows if the
creation/deletion of the flow-state was successful (as indicated by
the bpf_map_*_elem() return value). This should avoid outputting
several flow creation/deletion messages in case multiple instances are
trying to create/delete a flow concurrently, as could theoretically
occur previously.

Also set the last_timestamp value before creating a new flow, to avoid
a race condition where the userspace cleanup might incorrectly
determine that a flow is old before the last_timestamp value can be
set. Explicitly skip the rate-limit for the first packet of a new flow
to avoid it failing the rate-limit. This also fixes an issue where the
first packet of a new flow would previously fail the rate-limit if the
rate-limit was higher than current time uptime (CLOCK_MONOTONIC).

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add an option (-R, --rtt-rate) to adapt the rate sampling based on the
RTT of the flow. The sampling rate will be C * RTT, where C is a
configurable constant (ex 1.0 to get one sample every RTT), and RTT
is either the current minimum (default) or smoothed RTT of the
flow (chosen via the -T or --rtt-type option).

The smoothed RTT (sRTT) is updated for each calculated RTT, and is
calculated in a similar manner to srtt in the kernel's TCP stack. The
sRTT is a moving average of all RTTs, and is calculated according to
the formula:

  srtt = 7/8 * prev_srtt + 1/8 * rtt

To allow the user to pass a non-integer C (ex 0.1 to get 10 RTT
samples for every RTT-period), fixed-point arithmetic have been used
in the eBPF programs (due to lack of support for floats). The maximum
value for C has been limited to 10000 in order for it to be unlikely
that the C * RTT calculation will overflow (with C = 10000, overflow
will only occur if RTT > 28 seconds).

Concerns with this commit:
- If RTT-based rate sampling is used, the update rate of flow RTT
  stats depends on the current value of the RTT. A flow with a high
  RTT will create RTT samples less frequently, and thus the RTT of
  that flow  will also update less frequently. This could potentially
  be undesired/unexpected behavior.
  - A consequence of this is that the sRTT could generally be expected
    to more quickly adept to growing RTTs than shrinking RTTs, as it
    will be updated more frequently if it starts with lower RTT
    values. The minimum RTT does not have this issue as it does not
    grow.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Perform both timestamping and matching on both ingress and egress
hooks. This makes it more similar to Kathie's pping, allowing the tool
to capture RTTs in both directions when deployed on just a single
interface.

Like Kathie's pping, by default filter out RTTs for packets going to
the local machine (will only include local processing delays). This
behavior can be disabled by passing the -l/--localfilt-off option.

As packets that are timestamped on ingress and matched on egress will
include the local machines processing delay, add the "match_on_egress"
member to the JSON output that can be used to differentiate between
RTTs that include the local processing delay, and those which don't.

Finally, report the source and destination addresses from the perspective
of the reply packet, rather than the timestamped packet, to be
consistent with Kathie's pping.

Overall, refactor large parts of pping_kern to allow both timestamping
and matching, as well as updating both the flow and reverse flow and
handle flow-events related to them, in one go. Also update README to
reflect changes.

Concerns with this commit:
- Performance may be worse due to the increased complexity of handling
  both directions of the flow. Additionally, if local-filtering is
  used (enabled by default), will also have to perform a fib-lookup.
- For the standard and ppviz output formats, it's not possible to tell
  if the reported RTT includes delays from local processing or not.
- No longer works kernel 5.4 (verifier seems to object against lookup
  of config.use_srtt).

Other notes:
- Verifier seems to have a much easier time verifying the refactored
  code, "only" processing 156k instructions (down from ~850k) despite
  the overall increase in complexity.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
@simosund simosund force-pushed the pping_miscellaneous_fixes branch from c3d5df9 to 3d1a250 Compare July 12, 2021 17:56
Wait with sending a flow open message until a reply has been seen for
the flow. Likewise, only emit a flow closing event if the flow has
first been opened (that is, a reply has been seen).

Concerns with this commit:
- Will no longer report any RTT for packets that close the flow (will
  delete the relevant flow_state before attempting to match the
  packet).
- Introduces potential concurrency issues for flow opening/closing
  messages (although I deem them unlikely). The README has been
  updated to describe these.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
@simosund
Copy link
Contributor Author

One small issue I've also noticed is that since the updated libbpf version, pping will now give out the error message "libbpf: Kernel error message: Exclusivity flag on, cannot modify" if there already is a clsact qdisc on the interface (libbpf seems to give that warning simply from attempting to create a tc hook, even if I handle the case with EEXIST). The program still runs as expected though.

Update README with an introduction motivating why an eBPF
implementation of pping is useful. Also add a figure description for
the design figure.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Flip the old design figure upside down, so that the userspace part
ends up on top and the kernel space part in the bottom.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
@simosund simosund marked this pull request as draft December 8, 2021 08:36
@simosund
Copy link
Contributor Author

simosund commented Dec 8, 2021

As quite a few commits with various unrelated changes and fixes have been added to this PR, is is probabably a bit of a nightmare to try and review. I will therefore try to split up the changes in this PR into multiple smaller PRs that I will start up in the comming days. Will keep this PR around as a draft for now simply as a point of reference for the other PRs.

@simosund
Copy link
Contributor Author

Everything in this PR has already been merged through other PRs, so closing this now.

@simosund simosund closed this Mar 10, 2022
@simosund simosund deleted the pping_miscellaneous_fixes branch May 31, 2022 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants