Skip to content

Commit dd24bce

Browse files
authored
Merge pull request #32 from simosund/pping_improve_capabilities
PPing core improvements
2 parents c14f52a + b8215e7 commit dd24bce

File tree

6 files changed

+925
-394
lines changed

6 files changed

+925
-394
lines changed

pping/README.md

+49-140
Original file line numberDiff line numberDiff line change
@@ -6,28 +6,36 @@ TC-BPF (on egress) for the packet capture logic.
66
## Simple description
77
Passive Ping (PPing) is a simple tool for passively measuring per-flow RTTs. It
88
can be used on endhosts as well as any (BPF-capable Linux) device which can see
9-
both directions of the traffic (ex router or middlebox). Currently it only works
10-
for TCP traffic which uses the TCP timestamp option, but could be extended to
11-
also work with for example TCP seq/ACK numbers, the QUIC spinbit and ICMP
12-
echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
13-
(which may or may not ever get implemented).
9+
both directions of the traffic (ex router or middlebox). Currently it works for
10+
TCP traffic which uses the TCP timestamp option and ICMP echo messages, but
11+
could be extended to also work with for example TCP seq/ACK numbers, the QUIC
12+
spinbit and DNS queries. See the [TODO-list](./TODO.md) for more potential
13+
features (which may or may not ever get implemented).
1414

1515
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
16-
outgoing packets, and then look for matches in the incoming packets. If a match
17-
is found, the RTT is simply calculated as the time difference between the
18-
current time and the stored timestamp.
16+
packets, and then look for matches in the reply packets. If a match is found,
17+
the RTT is simply calculated as the time difference between the current time and
18+
the stored timestamp.
1919

2020
This tool, just as Kathie's original pping implementation, uses TCP timestamps
21-
as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
22-
itself) is timestamped. Incoming packets are then parsed for the TSecr, which
23-
are the echoed TSval values from the receiver. The TCP timestamps are not
24-
necessarily unique for every packet (they have a limited update frequency,
25-
appears to be 1000 Hz for modern Linux systems), so only the first instance of
26-
an identifier is timestamped, and matched against the first incoming packet with
27-
the identifier. The mechanism to ensure only the first packet is timestamped and
28-
matched differs from the one in Kathie's pping, and is further described in
21+
as identifiers for TCP traffic. The TSval (which is a timestamp in and off
22+
itself) is used as an identifier and timestamped. Reply packets in the reverse
23+
flow are then parsed for the TSecr, which are the echoed TSval values from the
24+
receiver. The TCP timestamps are not necessarily unique for every packet (they
25+
have a limited update frequency, appears to be 1000 Hz for modern Linux
26+
systems), so only the first instance of an identifier is timestamped, and
27+
matched against the first incoming packet with a matching reply identifier. The
28+
mechanism to ensure only the first packet is timestamped and matched differs
29+
from the one in Kathie's pping, and is further described in
2930
[SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
3031

32+
For ICMP echo, it uses the echo identifier as port numbers, and echo sequence
33+
number as identifer to match against. Linux systems will typically use different
34+
echo identifers for different instances of ping, and thus each ping instance
35+
will be recongnized as a separate flow. Windows systems typically use a static
36+
echo identifer, and thus all instaces of ping originating from a particular
37+
Windows host and the same target host will be considered a single flow.
38+
3139
## Output formats
3240
pping currently supports 3 different formats, *standard*, *ppviz* and *json*. In
3341
general, the output consists of two different types of events, flow-events which
@@ -41,12 +49,12 @@ single line per event.
4149

4250
An example of the format is provided below:
4351
```shell
44-
16:00:46.142279766 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src
45-
16:00:46.147705205 5.425439 ms 5.425439 ms 10.11.1.1:5201+10.11.1.2:59528
46-
16:00:47.148905125 5.261430 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
47-
16:00:48.151666385 5.972284 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
48-
16:00:49.152489316 6.017589 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
49-
16:00:49.878508114 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
52+
16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from dest
53+
16:00:46.147705205 5.425439 ms 5.425439 ms TCP 10.11.1.1:5201+10.11.1.2:59528
54+
16:00:47.148905125 5.261430 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
55+
16:00:48.151666385 5.972284 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
56+
16:00:49.152489316 6.017589 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
57+
16:00:49.878508114 TCP 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
5058
```
5159

5260
### ppviz format
@@ -89,7 +97,7 @@ An example of a (pretty-printed) flow-event is provided below:
8997
"protocol": "TCP",
9098
"flow_event": "opening",
9199
"reason": "SYN-ACK",
92-
"triggered_by": "src"
100+
"triggered_by": "dest"
93101
}
94102
```
95103

@@ -107,7 +115,8 @@ An example of a (pretty-printed) RTT-even is provided below:
107115
"sent_packets": 9393,
108116
"sent_bytes": 492457296,
109117
"rec_packets": 5922,
110-
"rec_bytes": 37
118+
"rec_bytes": 37,
119+
"match_on_egress": false
111120
}
112121
```
113122

@@ -116,136 +125,36 @@ An example of a (pretty-printed) RTT-even is provided below:
116125

117126
### Files:
118127
- **pping.c:** Userspace program that loads and attaches the BPF programs, pulls
119-
the perf-buffer `rtt_events` to print out RTT messages and periodically cleans
128+
the perf-buffer `events` to print out RTT messages and periodically cleans
120129
up the hash-maps from old entries. Also passes user options to the BPF
121130
programs by setting a "global variable" (stored in the programs .rodata
122131
section).
123-
- **pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and
124-
XDP (ingress), as well as several common functions, a global constant `config`
125-
(set from userspace) and map definitions. The tc program `pping_egress()`
126-
parses outgoing packets for identifiers. If an identifier is found and the
127-
sampling strategy allows it, a timestamp for the packet is created in
128-
`packet_ts`. The XDP program `pping_ingress()` parses incomming packets for an
129-
identifier. If found, it looks up the `packet_ts` map for a match on the
130-
reverse flow (to match source/dest on egress). If there is a match, it
131-
calculates the RTT from the stored timestamp and deletes the entry. The
132-
calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
133-
`events`. Both `pping_egress()` and `pping_ingress` can also push flow-events
134-
to the `events` buffer.
132+
- **pping_kern.c:** Contains the BPF programs that are loaded on egress (tc) and
133+
ingress (XDP or tc), as well as several common functions, a global constant
134+
`config` (set from userspace) and map definitions. Essentially the same pping
135+
program is loaded on both ingress and egress. All packets are parsed for both
136+
an identifier that can be used to create a timestamp entry `packet_ts`, and a
137+
reply identifier that can be used to match the packet with a previously
138+
timestamped one in the reverse flow. If a match is found, an RTT is calculated
139+
and an RTT-event is pushed to userspace through the perf-buffer `events`. For
140+
each packet with a valid identifier, the program also keeps track of and
141+
updates the state flow and reverse flow, stored in the `flow_state` map.
135142
- **pping.h:** Common header file included by `pping.c` and
136143
`pping_kern.c`. Contains some common structs used by both (are part of the
137144
maps).
138145

139146
### BPF Maps:
140147
- **flow_state:** A hash-map storing some basic state for each flow, such as the
141148
last seen identifier for the flow and when the last timestamp entry for the
142-
flow was created. Entries are created by `pping_egress()`, and can be updated
143-
or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries
144-
are eventually removed by `pping.c`.
149+
flow was created. Entries are created, updated and deleted by the BPF pping
150+
programs. Leftover entries are eventually removed by userspace (`pping.c`).
145151
- **packet_ts:** A hash-map storing a timestamp for a specific packet
146-
identifier. Entries are created by `pping_egress()` and removed by
147-
`pping_ingress()` if a match is found. Leftover entries are eventually removed
148-
by `pping.c`.
152+
identifier. Entries are created by the BPF pping program if a valid identifier
153+
is found, and removed if a match is found. Leftover entries are eventually
154+
removed by userspace (`pping.c`).
149155
- **events:** A perf-buffer used by the BPF programs to push flow or RTT events
150156
to `pping.c`, which continuously polls the map the prints them out.
151157

152-
### A note on concurrency
153-
The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
154-
programs need to see the global view to function properly, using `PERCPU` maps
155-
is not an option. The program must be able to match against stored packet
156-
timestamps regardless of the CPU the packets are processed on, and must also
157-
have a global view of the flow state in order for the sampling to work
158-
correctly.
159-
160-
As the BPF programs may run concurrently on different CPU cores accessing these
161-
global hash maps, this may result in some concurrency issues. In practice, I do
162-
not believe these will occur particularly often, as I'm under the impression
163-
that packets from the same flow will typically be processed by the some
164-
CPU. Furthermore, most of the concurrency issues will not be that problematic
165-
even if they do occur. For now, I've therefore left these concurrency issues
166-
unattended, even if some of them could be avoided with atomic operations and/or
167-
spinlocks, in order to keep things simple and not hurt performance.
168-
169-
The (known) potential concurrency issues are:
170-
171-
#### Tracking last seen identifier
172-
The tc/egress program keeps track of the last seen outgoing identifier for each
173-
flow, by storing it in the `flow_state` map. This is done to detect the first
174-
packet with a new identifier. If multiple packets are processed concurrently,
175-
several of them could potentially detect themselves as being first with the same
176-
identifier (which only matters if they also pass rate-limit check as well),
177-
alternatively if the concurrent packets have different identifiers there may be
178-
a lost update (but for TCP timestamps, concurrent packets would typically be
179-
expected to have the same timestamp).
180-
181-
A possibly more severe issue is out-of-order packets. If a packet with an old
182-
identifier arrives out of order, that identifier could be detected as a new
183-
identifier. If for example the following flow of four packets with just two
184-
different identifiers (id1 and id2) were to occur:
185-
186-
id1 -> id2 -> id1 -> id2
187-
188-
Then the tc/egress program would consider each of these packets to have new
189-
identifiers and try to create a new timestamp for each of them if the sampling
190-
strategy allows it. However even if the sampling strategy allows it, the
191-
(incorrect) creation of timestamps for id1 and id2 the second time would only be
192-
successful in case the first timestamps for id1 and id2 have already been
193-
matched against (and thus deleted). Even if that is the case, they would only
194-
result in reporting an incorrect RTT in case there are also new matches against
195-
these identifiers.
196-
197-
This issue could be avoided entirely by requiring that new-id > old-id instead
198-
of simply checking that new-id != old-id, as TCP timestamps should monotonically
199-
increase. That may however not be a suitable solution if/when we add support for
200-
other types of identifiers.
201-
202-
#### Rate-limiting new timestamps
203-
In the tc/egress program packets to timestamp are sampled by using a per-flow
204-
rate-limit, which is enforced by storing when the last timestamp was created in
205-
the `flow_state` map. If multiple packets perform this check concurrently, it's
206-
possible that multiple packets think they are allowed to create timestamps
207-
before any of them are able to update the `last_timestamp`. When they update
208-
`last_timestamp` it might also be slightly incorrect, however if they are
209-
processed concurrently then they should also generate very similar timestamps.
210-
211-
If the packets have different identifiers, (which would typically not be
212-
expected for concurrent TCP timestamps), then this would allow some packets to
213-
bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
214-
additional map space and report some additional RTT(s) more than expected
215-
(however the reported RTTs should still be correct).
216-
217-
If the packets have the same identifier, they must first have managed to bypass
218-
the previous check for unique identifiers (see [previous point](#Tracking last
219-
seen identifier)), and only one of them will be able to successfully store a
220-
timestamp entry.
221-
222-
#### Matching against stored timestamps
223-
The XDP/ingress program could potentially match multiple concurrent packets with
224-
the same identifier against a single timestamp entry in `packet_ts`, before any
225-
of them manage to delete the timestamp entry. This would result in multiple RTTs
226-
being reported for the same identifier, but if they are processed concurrently
227-
these RTTs should be very similar, so would mainly result in over-reporting
228-
rather than reporting incorrect RTTs.
229-
230-
#### Updating flow statistics
231-
Both the tc/egress and XDP/ingress programs will try to update some flow
232-
statistics each time they successfully parse a packet with an
233-
identifier. Specifically, they'll update the number of packets and bytes
234-
sent/received. This is not done in an atomic fashion, so there could potentially
235-
be some lost updates resulting an underestimate.
236-
237-
Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
238-
if this is the lowest RTT seen so far for the flow. If multiple RTTs are
239-
calculated concurrently, then several could pass this check concurrently and
240-
there may be a lost update. It should only be possible for multiple RTTs to be
241-
calculated concurrently in case either the [timestamp rate-limit was
242-
bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
243-
against the same timestamp](#Matching against stored timestamps).
244-
245-
It's worth noting that with sampling the reported minimum-RTT is only an
246-
estimate anyways (may never calculate RTT for packet with the true minimum
247-
RTT). And even without sampling there is some inherent sampling due to TCP
248-
timestamps only being updated at a limited rate (1000 Hz).
249158

250159
## Similar projects
251160
Passively measuring the RTT for TCP traffic is not a novel concept, and there

0 commit comments

Comments
 (0)