Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPing - Add introduction and figure description #33

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 100 additions & 32 deletions pping/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,70 @@
# PPing using XDP and TC-BPF
A re-implementation of [Kathie Nichols' passive ping
(pping)](https://github.com/pollere/pping) utility using XDP (on ingress) and
TC-BPF (on egress) for the packet capture logic.
# ePPing - extending PPing with BPF
A re-implementation of [Kathie Nichols' passive ping (PPing)][k-pping] utility
using XDP and TC-BPF for the packet capture logic.

## Why implement PPing in BPF?
When evaluating network performance the focus has traditionally been on
throughput. However, for many applications latency may be an equally or even
more important metric. Being able to measure network latency is therefore
essential for understanding network performance, and may also prove useful when
troubleshooting applications or network missconfigurations.

The most well known tool for measuring network latency is probably ping, which
reports a Round Trip Time (RTT) to a target node by sending a message and
measuring the time until it gets a response. Ping is universally available due
to being standardized in the ICMP protocol, which usually makes it a good first
choice for determining the idle latency between two nodes. Several other tools
have also extended this basic approach of sending an active network probe to
measure latency to other protocols, such as [hping][hping], [IRTT][IRTT] and
[netlatency][netlatency]. However, active network measurements have some drawbacks.

1. They inject additional packets on the network, and may therefore affect the
normal network traffic. While individual network probes likely have
negligible effects, frequent probes required to get high resolution RTT
metrics could have a considerable impact.
2. They need to send probes between each pair of nodes of interest. This can be
cumbersome and does not scale well to large networks with many nodes.
3. They report the latency experience by the probe packet, which may different
from the latency experienced by normal network traffic. Network probes may be
treated differently than other traffic due to for example AQM, load balancing
and various middelboxes. Additionally, network probes are typically
relatively sparse, and will likely fail to capture latency introduced by
bufferbloat if run on an idle network.

Passive network monitoring avoid these issues by using a different
approach. Instead of sending out additional network traffic passive monitoring
inspects existing traffic. Passive Ping (PPing) thus reports RTTs by looking at
the latency experienced by existing traffic. PPing therefore adds no network
overhead, can report RTTs to any hosts for which it can see the traffic to and
from regardless if its run on an endhost or middlebox, and the reported RTTs
correspond to the latency experienced by the real traffic.

Kathleen Nichols proved the feasibility of this approach by implementing
[PPing][k-pping] for TCP traffic, based on the TCP timestamp option. Kathie's C++
implementation, like most userspace programs, uses the traditional but rather
inefficient technique of copying packets to userspace and parsing them there. At
high line rates copying all packets to userspace is very resource demanding, and
it may not be possible for the program to keep up with the network traffic,
leading to it missing packets.

With ePPing we want to leverage the power of BPF to fix this inefficiency, and
thereby enable ePPing to work at much higher line rates while maintaining a low
monitoring overhead. Using BPF, the packets can be parsed directly in kernel
space while passing through the network stack, thereby avoiding copying packets
to userspace altogether. While we're at it we are also adding some additional
features, like a JSON output mode and adding support for additional protocols
beyond TCP.

## Simple description
Passive Ping (PPing) is a simple tool for passively measuring per-flow RTTs. It
ePPing is a simple tool for passively measuring per-flow RTTs. It
can be used on endhosts as well as any (BPF-capable Linux) device which can see
both directions of the traffic (ex router or middlebox). Currently it only works
for TCP traffic which uses the TCP timestamp option, but could be extended to
also work with for example TCP seq/ACK numbers, the QUIC spinbit and ICMP
echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
(which may or may not ever get implemented).

The fundamental logic of pping is to timestamp a pseudo-unique identifier for
The fundamental logic of ePPing is to timestamp a pseudo-unique identifier for
outgoing packets, and then look for matches in the incoming packets. If a match
is found, the RTT is simply calculated as the time difference between the
current time and the stored timestamp.
Expand All @@ -29,10 +81,10 @@ matched differs from the one in Kathie's pping, and is further described in
[SAMPLING_DESIGN](./SAMPLING_DESIGN.md).

## Output formats
pping currently supports 3 different formats, *standard*, *ppviz* and *json*. In
general, the output consists of two different types of events, flow-events which
gives information that a flow has started/ended, and RTT-events which provides
information on a computed RTT within a flow.
ePPing currently supports 3 different formats, *standard*, *ppviz* and *json*.
In general, the output consists of two different types of events, flow-events
which gives information that a flow has started/ended, and RTT-events which
provides information on a computed RTT within a flow.

### Standard format
The standard format is quite similar to the Kathie's pping default output, and
Expand All @@ -51,9 +103,9 @@ An example of the format is provided below:

### ppviz format
The ppviz format is primarily intended to be used to generate data that can be
visualized by Kathie's [ppviz](https://github.com/pollere/ppviz) tool. The
format is essentially a CSV format, using a single space as the separator, and
is further described [here](http://www.pollere.net/ppviz.html).
visualized by Kathie's [ppviz][ppviz] tool. The format is essentially a CSV
format, using a single space as the separator, and is further described
[here](http://www.pollere.net/ppviz.html).

Note that the optional *FBytes*, *DBytes* and *PBytes* from the format
specification have not been included here, and do not appear to be used by
Expand Down Expand Up @@ -114,6 +166,13 @@ An example of a (pretty-printed) RTT-even is provided below:
## Design and technical description
!["Design of eBPF pping](./eBPF_pping_design.png)

ePPing consists of two major components, the kernel space BPF program and
the userspace program. The BPF program parses incoming and outgoing packets, and
uses BPF maps to store packet timestamps as well as some state about each
flow. When the BPF program can match a reply packet against one of the stored
packet timestamps, it pushes the calculated RTT to the userspace program which
in turn prints it out.

### Files:
- **pping.c:** Userspace program that loads and attaches the BPF programs, pulls
the perf-buffer `rtt_events` to print out RTT messages and periodically cleans
Expand Down Expand Up @@ -168,7 +227,7 @@ correctly.
As the BPF programs may run concurrently on different CPU cores accessing these
global hash maps, this may result in some concurrency issues. In practice, I do
not believe these will occur particularly often, as I'm under the impression
that packets from the same flow will typically be processed by the some
that packets from the same flow will typically be processed by the same
CPU. Furthermore, most of the concurrency issues will not be that problematic
even if they do occur. For now, I've therefore left these concurrency issues
unattended, even if some of them could be avoided with atomic operations and/or
Expand Down Expand Up @@ -259,23 +318,32 @@ timestamps only being updated at a limited rate (1000 Hz).
Passively measuring the RTT for TCP traffic is not a novel concept, and there
exists a number of other tools that can do so. A good overview of how passive
RTT calculation using TCP timestamps (as in this project) works is provided in
[this paper](https://doi.org/10.1145/2523426.2539132) from 2013.

- [pping](https://github.com/pollere/pping): This project is largely a
re-implementation of Kathie's pping, but by using BPF and XDP as well as
implementing some filtering logic the hope is to be able to create a always-on
tool that can scale well even to large amounts of massive flows.
- [ppviz](https://github.com/pollere/ppviz): Web-based visualization tool for
the "machine-friendly" (-m) output from Kathie's pping tool. Running this
implementation of pping with --format="ppviz" will generate output that can be
used by ppviz.
- [tcptrace](https://github.com/blitz/tcptrace): A post-processing tool which
can analyze a tcpdump file and among other things calculate RTTs based on
seq/ACK numbers (`-r` or `-R` flag).
[this paper][passive-TCP-RTT] from 2013.

- [PPing][k-pping]: The original C++ implementation by Kathleen Nichols. Our
ePPing is largely a re-implementation of PPing in BPF and is thus heavily
inspired by Kathie's work.
- [ppviz][ppviz]: Web-based visualization tool for the "machine-friendly" (-m)
output from Kathie's PPing. Running ePPing with --format="ppviz" will
generate output that can be used by ppviz.
- [tcptrace][tcptrace]: A post-processing tool which can analyze a tcpdump file
and among other things calculate RTTs based on seq/ACK numbers (`-r` or `-R`
flag).
- **Dapper**: A passive TCP data plane monitoring tool implemented in P4 which
can among other things calculate the RTT based on the matching seq/ACK
numbers. [Paper](https://doi.org/10.1145/3050220.3050228). [Unofficial
source](https://github.com/muhe1991/p4-programs-survey/tree/master/dapper).
- [P4 Tofino TCP RTT measurement](https://github.com/Princeton-Cabernet/p4-projects/tree/master/RTT-tofino):
A passive TCP RTT monitor based on seq/ACK numbers implemented in P4 for
Tofino programmable switches. [Paper](https://doi.org/10.1145/3405669.3405823).
numbers. [Paper][dapper-paper]. [Unofficial source][dapper-source].
- [P4 Tofino TCP RTT measurement][P4RTT-source]: A passive TCP RTT monitor based
on seq/ACK numbers implemented in P4 for Tofino programmable
switches. [Paper][P4RTT-paper].

[passive-TCP-RTT]: https://doi.org/10.1145/2523426.2539132
[k-pping]: https://github.com/pollere/pping
[ppviz]: https://github.com/pollere/ppviz
[tcptrace]: https://github.com/blitz/tcptrace
[hping]: http://www.hping.org/
[IRTT]: https://github.com/heistp/irtt
[netlatency]: https://github.com/kontron/netlatency
[dapper-source]: https://github.com/muhe1991/p4-programs-survey/tree/master/dapper
[dapper-paper]: https://doi.org/10.1145/3050220.3050228
[P4RTT-source]: https://github.com/Princeton-Cabernet/p4-projects/tree/master/RTT-tofino
[P4RTT-paper]: https://doi.org/10.1145/3405669.3405823