@@ -6,28 +6,36 @@ TC-BPF (on egress) for the packet capture logic.
6
6
## Simple description
7
7
Passive Ping (PPing) is a simple tool for passively measuring per-flow RTTs. It
8
8
can be used on endhosts as well as any (BPF-capable Linux) device which can see
9
- both directions of the traffic (ex router or middlebox). Currently it only works
10
- for TCP traffic which uses the TCP timestamp option, but could be extended to
11
- also work with for example TCP seq/ACK numbers, the QUIC spinbit and ICMP
12
- echo-reply messages . See the [ TODO-list] ( ./TODO.md ) for more potential features
13
- (which may or may not ever get implemented).
9
+ both directions of the traffic (ex router or middlebox). Currently it works for
10
+ TCP traffic which uses the TCP timestamp option and ICMP echo messages, but
11
+ could be extended to also work with for example TCP seq/ACK numbers, the QUIC
12
+ spinbit and DNS queries . See the [ TODO-list] ( ./TODO.md ) for more potential
13
+ features (which may or may not ever get implemented).
14
14
15
15
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
16
- outgoing packets, and then look for matches in the incoming packets. If a match
17
- is found, the RTT is simply calculated as the time difference between the
18
- current time and the stored timestamp.
16
+ packets, and then look for matches in the reply packets. If a match is found,
17
+ the RTT is simply calculated as the time difference between the current time and
18
+ the stored timestamp.
19
19
20
20
This tool, just as Kathie's original pping implementation, uses TCP timestamps
21
- as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
22
- itself) is timestamped. Incoming packets are then parsed for the TSecr, which
23
- are the echoed TSval values from the receiver. The TCP timestamps are not
24
- necessarily unique for every packet (they have a limited update frequency,
25
- appears to be 1000 Hz for modern Linux systems), so only the first instance of
26
- an identifier is timestamped, and matched against the first incoming packet with
27
- the identifier. The mechanism to ensure only the first packet is timestamped and
28
- matched differs from the one in Kathie's pping, and is further described in
21
+ as identifiers for TCP traffic. The TSval (which is a timestamp in and off
22
+ itself) is used as an identifier and timestamped. Reply packets in the reverse
23
+ flow are then parsed for the TSecr, which are the echoed TSval values from the
24
+ receiver. The TCP timestamps are not necessarily unique for every packet (they
25
+ have a limited update frequency, appears to be 1000 Hz for modern Linux
26
+ systems), so only the first instance of an identifier is timestamped, and
27
+ matched against the first incoming packet with a matching reply identifier. The
28
+ mechanism to ensure only the first packet is timestamped and matched differs
29
+ from the one in Kathie's pping, and is further described in
29
30
[ SAMPLING_DESIGN] ( ./SAMPLING_DESIGN.md ) .
30
31
32
+ For ICMP echo, it uses the echo identifier as port numbers, and echo sequence
33
+ number as identifer to match against. Linux systems will typically use different
34
+ echo identifers for different instances of ping, and thus each ping instance
35
+ will be recongnized as a separate flow. Windows systems typically use a static
36
+ echo identifer, and thus all instaces of ping originating from a particular
37
+ Windows host and the same target host will be considered a single flow.
38
+
31
39
## Output formats
32
40
pping currently supports 3 different formats, * standard* , * ppviz* and * json* . In
33
41
general, the output consists of two different types of events, flow-events which
@@ -41,12 +49,12 @@ single line per event.
41
49
42
50
An example of the format is provided below:
43
51
``` shell
44
- 16:00:46.142279766 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src
45
- 16:00:46.147705205 5.425439 ms 5.425439 ms 10.11.1.1:5201+10.11.1.2:59528
46
- 16:00:47.148905125 5.261430 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
47
- 16:00:48.151666385 5.972284 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
48
- 16:00:49.152489316 6.017589 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
49
- 16:00:49.878508114 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
52
+ 16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from dest
53
+ 16:00:46.147705205 5.425439 ms 5.425439 ms TCP 10.11.1.1:5201+10.11.1.2:59528
54
+ 16:00:47.148905125 5.261430 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
55
+ 16:00:48.151666385 5.972284 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
56
+ 16:00:49.152489316 6.017589 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
57
+ 16:00:49.878508114 TCP 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
50
58
```
51
59
52
60
### ppviz format
@@ -89,7 +97,7 @@ An example of a (pretty-printed) flow-event is provided below:
89
97
"protocol" : " TCP" ,
90
98
"flow_event" : " opening" ,
91
99
"reason" : " SYN-ACK" ,
92
- "triggered_by" : " src "
100
+ "triggered_by" : " dest "
93
101
}
94
102
```
95
103
@@ -107,7 +115,8 @@ An example of a (pretty-printed) RTT-even is provided below:
107
115
"sent_packets" : 9393 ,
108
116
"sent_bytes" : 492457296 ,
109
117
"rec_packets" : 5922 ,
110
- "rec_bytes" : 37
118
+ "rec_bytes" : 37 ,
119
+ "match_on_egress" : false
111
120
}
112
121
```
113
122
@@ -116,136 +125,36 @@ An example of a (pretty-printed) RTT-even is provided below:
116
125
117
126
### Files:
118
127
- ** pping.c:** Userspace program that loads and attaches the BPF programs, pulls
119
- the perf-buffer ` rtt_events ` to print out RTT messages and periodically cleans
128
+ the perf-buffer ` events ` to print out RTT messages and periodically cleans
120
129
up the hash-maps from old entries. Also passes user options to the BPF
121
130
programs by setting a "global variable" (stored in the programs .rodata
122
131
section).
123
- - ** pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and
124
- XDP (ingress), as well as several common functions, a global constant ` config `
125
- (set from userspace) and map definitions. The tc program ` pping_egress() `
126
- parses outgoing packets for identifiers. If an identifier is found and the
127
- sampling strategy allows it, a timestamp for the packet is created in
128
- ` packet_ts ` . The XDP program ` pping_ingress() ` parses incomming packets for an
129
- identifier. If found, it looks up the ` packet_ts ` map for a match on the
130
- reverse flow (to match source/dest on egress). If there is a match, it
131
- calculates the RTT from the stored timestamp and deletes the entry. The
132
- calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
133
- ` events ` . Both ` pping_egress() ` and ` pping_ingress ` can also push flow-events
134
- to the ` events ` buffer.
132
+ - ** pping_kern.c:** Contains the BPF programs that are loaded on egress (tc) and
133
+ ingress (XDP or tc), as well as several common functions, a global constant
134
+ ` config ` (set from userspace) and map definitions. Essentially the same pping
135
+ program is loaded on both ingress and egress. All packets are parsed for both
136
+ an identifier that can be used to create a timestamp entry ` packet_ts ` , and a
137
+ reply identifier that can be used to match the packet with a previously
138
+ timestamped one in the reverse flow. If a match is found, an RTT is calculated
139
+ and an RTT-event is pushed to userspace through the perf-buffer ` events ` . For
140
+ each packet with a valid identifier, the program also keeps track of and
141
+ updates the state flow and reverse flow, stored in the ` flow_state ` map.
135
142
- ** pping.h:** Common header file included by ` pping.c ` and
136
143
` pping_kern.c ` . Contains some common structs used by both (are part of the
137
144
maps).
138
145
139
146
### BPF Maps:
140
147
- ** flow_state:** A hash-map storing some basic state for each flow, such as the
141
148
last seen identifier for the flow and when the last timestamp entry for the
142
- flow was created. Entries are created by ` pping_egress() ` , and can be updated
143
- or deleted by both ` pping_egress() ` and ` pping_ingress() ` . Leftover entries
144
- are eventually removed by ` pping.c ` .
149
+ flow was created. Entries are created, updated and deleted by the BPF pping
150
+ programs. Leftover entries are eventually removed by userspace (` pping.c ` ).
145
151
- ** packet_ts:** A hash-map storing a timestamp for a specific packet
146
- identifier. Entries are created by ` pping_egress() ` and removed by
147
- ` pping_ingress() ` if a match is found. Leftover entries are eventually removed
148
- by ` pping.c ` .
152
+ identifier. Entries are created by the BPF pping program if a valid identifier
153
+ is found, and removed if a match is found. Leftover entries are eventually
154
+ removed by userspace ( ` pping.c ` ) .
149
155
- ** events:** A perf-buffer used by the BPF programs to push flow or RTT events
150
156
to ` pping.c ` , which continuously polls the map the prints them out.
151
157
152
- ### A note on concurrency
153
- The program uses "global" (not ` PERCPU ` ) hash maps to keep state. As the BPF
154
- programs need to see the global view to function properly, using ` PERCPU ` maps
155
- is not an option. The program must be able to match against stored packet
156
- timestamps regardless of the CPU the packets are processed on, and must also
157
- have a global view of the flow state in order for the sampling to work
158
- correctly.
159
-
160
- As the BPF programs may run concurrently on different CPU cores accessing these
161
- global hash maps, this may result in some concurrency issues. In practice, I do
162
- not believe these will occur particularly often, as I'm under the impression
163
- that packets from the same flow will typically be processed by the some
164
- CPU. Furthermore, most of the concurrency issues will not be that problematic
165
- even if they do occur. For now, I've therefore left these concurrency issues
166
- unattended, even if some of them could be avoided with atomic operations and/or
167
- spinlocks, in order to keep things simple and not hurt performance.
168
-
169
- The (known) potential concurrency issues are:
170
-
171
- #### Tracking last seen identifier
172
- The tc/egress program keeps track of the last seen outgoing identifier for each
173
- flow, by storing it in the ` flow_state ` map. This is done to detect the first
174
- packet with a new identifier. If multiple packets are processed concurrently,
175
- several of them could potentially detect themselves as being first with the same
176
- identifier (which only matters if they also pass rate-limit check as well),
177
- alternatively if the concurrent packets have different identifiers there may be
178
- a lost update (but for TCP timestamps, concurrent packets would typically be
179
- expected to have the same timestamp).
180
-
181
- A possibly more severe issue is out-of-order packets. If a packet with an old
182
- identifier arrives out of order, that identifier could be detected as a new
183
- identifier. If for example the following flow of four packets with just two
184
- different identifiers (id1 and id2) were to occur:
185
-
186
- id1 -> id2 -> id1 -> id2
187
-
188
- Then the tc/egress program would consider each of these packets to have new
189
- identifiers and try to create a new timestamp for each of them if the sampling
190
- strategy allows it. However even if the sampling strategy allows it, the
191
- (incorrect) creation of timestamps for id1 and id2 the second time would only be
192
- successful in case the first timestamps for id1 and id2 have already been
193
- matched against (and thus deleted). Even if that is the case, they would only
194
- result in reporting an incorrect RTT in case there are also new matches against
195
- these identifiers.
196
-
197
- This issue could be avoided entirely by requiring that new-id > old-id instead
198
- of simply checking that new-id != old-id, as TCP timestamps should monotonically
199
- increase. That may however not be a suitable solution if/when we add support for
200
- other types of identifiers.
201
-
202
- #### Rate-limiting new timestamps
203
- In the tc/egress program packets to timestamp are sampled by using a per-flow
204
- rate-limit, which is enforced by storing when the last timestamp was created in
205
- the ` flow_state ` map. If multiple packets perform this check concurrently, it's
206
- possible that multiple packets think they are allowed to create timestamps
207
- before any of them are able to update the ` last_timestamp ` . When they update
208
- ` last_timestamp ` it might also be slightly incorrect, however if they are
209
- processed concurrently then they should also generate very similar timestamps.
210
-
211
- If the packets have different identifiers, (which would typically not be
212
- expected for concurrent TCP timestamps), then this would allow some packets to
213
- bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
214
- additional map space and report some additional RTT(s) more than expected
215
- (however the reported RTTs should still be correct).
216
-
217
- If the packets have the same identifier, they must first have managed to bypass
218
- the previous check for unique identifiers (see [ previous point] (#Tracking last
219
- seen identifier)), and only one of them will be able to successfully store a
220
- timestamp entry.
221
-
222
- #### Matching against stored timestamps
223
- The XDP/ingress program could potentially match multiple concurrent packets with
224
- the same identifier against a single timestamp entry in ` packet_ts ` , before any
225
- of them manage to delete the timestamp entry. This would result in multiple RTTs
226
- being reported for the same identifier, but if they are processed concurrently
227
- these RTTs should be very similar, so would mainly result in over-reporting
228
- rather than reporting incorrect RTTs.
229
-
230
- #### Updating flow statistics
231
- Both the tc/egress and XDP/ingress programs will try to update some flow
232
- statistics each time they successfully parse a packet with an
233
- identifier. Specifically, they'll update the number of packets and bytes
234
- sent/received. This is not done in an atomic fashion, so there could potentially
235
- be some lost updates resulting an underestimate.
236
-
237
- Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
238
- if this is the lowest RTT seen so far for the flow. If multiple RTTs are
239
- calculated concurrently, then several could pass this check concurrently and
240
- there may be a lost update. It should only be possible for multiple RTTs to be
241
- calculated concurrently in case either the [ timestamp rate-limit was
242
- bypassed] (#Rate-limiting new timestamps) or [ multiple packets managed to match
243
- against the same timestamp] (#Matching against stored timestamps).
244
-
245
- It's worth noting that with sampling the reported minimum-RTT is only an
246
- estimate anyways (may never calculate RTT for packet with the true minimum
247
- RTT). And even without sampling there is some inherent sampling due to TCP
248
- timestamps only being updated at a limited rate (1000 Hz).
249
158
250
159
## Similar projects
251
160
Passively measuring the RTT for TCP traffic is not a novel concept, and there
0 commit comments