Skip to content

Commit a042d04

Browse files
committed
pping: Add section on potential issues to TODO-list
Collect potential issues under a new section in the TODO list. These are issues I generally don't think are that severe, but may still be useful to note down and keep in mind. Move the section on potential concurrency issues from README to the new section in the TODO-list. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
1 parent 3ff729d commit a042d04

File tree

2 files changed

+175
-114
lines changed

2 files changed

+175
-114
lines changed

pping/README.md

-111
Original file line numberDiff line numberDiff line change
@@ -155,117 +155,6 @@ An example of a (pretty-printed) RTT-even is provided below:
155155
- **events:** A perf-buffer used by the BPF programs to push flow or RTT events
156156
to `pping.c`, which continuously polls the map the prints them out.
157157

158-
### A note on concurrency
159-
The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
160-
programs need to see the global view to function properly, using `PERCPU` maps
161-
is not an option. The program must be able to match against stored packet
162-
timestamps regardless of the CPU the packets are processed on, and must also
163-
have a global view of the flow state in order for the sampling to work
164-
correctly.
165-
166-
As the BPF programs may run concurrently on different CPU cores accessing these
167-
global hash maps, this may result in some concurrency issues. In practice, I do
168-
not believe these will occur particularly often, as I'm under the impression
169-
that packets from the same flow will typically be processed by the some
170-
CPU. Furthermore, most of the concurrency issues will not be that problematic
171-
even if they do occur. For now, I've therefore left these concurrency issues
172-
unattended, even if some of them could be avoided with atomic operations and/or
173-
spinlocks, in order to keep things simple and not hurt performance.
174-
175-
The (known) potential concurrency issues are:
176-
177-
#### Tracking last seen identifier
178-
The tc/egress program keeps track of the last seen outgoing identifier for each
179-
flow, by storing it in the `flow_state` map. This is done to detect the first
180-
packet with a new identifier. If multiple packets are processed concurrently,
181-
several of them could potentially detect themselves as being first with the same
182-
identifier (which only matters if they also pass rate-limit check as well),
183-
alternatively if the concurrent packets have different identifiers there may be
184-
a lost update (but for TCP timestamps, concurrent packets would typically be
185-
expected to have the same timestamp).
186-
187-
A possibly more severe issue is out-of-order packets. If a packet with an old
188-
identifier arrives out of order, that identifier could be detected as a new
189-
identifier. If for example the following flow of four packets with just two
190-
different identifiers (id1 and id2) were to occur:
191-
192-
id1 -> id2 -> id1 -> id2
193-
194-
Then the tc/egress program would consider each of these packets to have new
195-
identifiers and try to create a new timestamp for each of them if the sampling
196-
strategy allows it. However even if the sampling strategy allows it, the
197-
(incorrect) creation of timestamps for id1 and id2 the second time would only be
198-
successful in case the first timestamps for id1 and id2 have already been
199-
matched against (and thus deleted). Even if that is the case, they would only
200-
result in reporting an incorrect RTT in case there are also new matches against
201-
these identifiers.
202-
203-
This issue could be avoided entirely by requiring that new-id > old-id instead
204-
of simply checking that new-id != old-id, as TCP timestamps should monotonically
205-
increase. That may however not be a suitable solution for other types of
206-
identifiers.
207-
208-
#### Rate-limiting new timestamps
209-
In the tc/egress program packets to timestamp are sampled by using a per-flow
210-
rate-limit, which is enforced by storing when the last timestamp was created in
211-
the `flow_state` map. If multiple packets perform this check concurrently, it's
212-
possible that multiple packets think they are allowed to create timestamps
213-
before any of them are able to update the `last_timestamp`. When they update
214-
`last_timestamp` it might also be slightly incorrect, however if they are
215-
processed concurrently then they should also generate very similar timestamps.
216-
217-
If the packets have different identifiers, (which would typically not be
218-
expected for concurrent TCP timestamps), then this would allow some packets to
219-
bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
220-
additional map space and report some additional RTT(s) more than expected
221-
(however the reported RTTs should still be correct).
222-
223-
If the packets have the same identifier, they must first have managed to bypass
224-
the previous check for unique identifiers (see [previous
225-
point](#tracking-last-seen-identifier)), and only one of them will be able to
226-
successfully store a timestamp entry.
227-
228-
#### Matching against stored timestamps
229-
The XDP/ingress program could potentially match multiple concurrent packets with
230-
the same identifier against a single timestamp entry in `packet_ts`, before any
231-
of them manage to delete the timestamp entry. This would result in multiple RTTs
232-
being reported for the same identifier, but if they are processed concurrently
233-
these RTTs should be very similar, so would mainly result in over-reporting
234-
rather than reporting incorrect RTTs.
235-
236-
#### Updating flow statistics
237-
Both the tc/egress and XDP/ingress programs will try to update some flow
238-
statistics each time they successfully parse a packet with an
239-
identifier. Specifically, they'll update the number of packets and bytes
240-
sent/received. This is not done in an atomic fashion, so there could potentially
241-
be some lost updates resulting an underestimate.
242-
243-
Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
244-
if this is the lowest RTT seen so far for the flow. If multiple RTTs are
245-
calculated concurrently, then several could pass this check concurrently and
246-
there may be a lost update. It should only be possible for multiple RTTs to be
247-
calculated concurrently in case either the [timestamp rate-limit was
248-
bypassed](#rate-limiting-new-timestamps) or [multiple packets managed to match
249-
against the same timestamp](#matching-against-stored-timestamps).
250-
251-
It's worth noting that with sampling the reported minimum-RTT is only an
252-
estimate anyways (may never calculate RTT for packet with the true minimum
253-
RTT). And even without sampling there is some inherent sampling due to TCP
254-
timestamps only being updated at a limited rate (1000 Hz).
255-
256-
#### Outputting flow opening/closing events
257-
A flow is not considered opened until a reply has been seen for it. The
258-
`flow_state` map keeps information about if the flow has been opened or not,
259-
which is checked and updated for each reply. The check and update of this
260-
information is not performed atomically, which may result in multiple replies
261-
thinking they are the first, emitting multiple flow-opened events, in case they
262-
are processed concurrently.
263-
264-
Likewise, when flows are closed it checks if the flow has been opened to
265-
determine if a flow closing message should be sent. If multiple replies are
266-
processed concurrently, it's possible one of them will update the flow-open
267-
information and emit a flow opening message, but another reply closing the flow
268-
without thinking it's ever been opened, thus not sending a flow closing message.
269158

270159
## Similar projects
271160
Passively measuring the RTT for TCP traffic is not a novel concept, and there

pping/TODO.md

+175-3
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
delayed ACKs, reordered or lost packets etc.
2626
- [x] Keep some per-flow state
2727
- Will likely be needed for the sampling
28-
- [ ] Could potentially include keeping track of average RTT, which
28+
- [x] Could potentially include keeping track of average RTT, which
2929
may be useful for some decisions (ex. how often to sample,
3030
when entry can be removed etc)
3131
- [x] Could potentially include keeping track of minimum RTT (as
@@ -41,8 +41,6 @@
4141
unnecessarily large, which slows down the cleaning and may block
4242
new entries
4343
- [ ] Use libxdp to load XDP program
44-
- [ ] Add support for other hooks
45-
- Ex TC-BFP on ingress instead of XDP?
4644

4745
## Done
4846
- [x] Clean up commits and add signed-off-by tags
@@ -63,3 +61,177 @@
6361
so that tools such as [ppviz](https://github.com/pollere/ppviz)
6462
works for both pping implementations.
6563
- [x] Add timestamps to output (as original pping)
64+
- [x] Add support for other hooks
65+
- TC-BFP on ingress instead of XDP
66+
67+
# Potential issues
68+
## Limited information in different output formats
69+
The ppviz format is a bit limited in what information it can
70+
include. One of these limitations is that it does not include any
71+
protocol information as it was designed with only TCP in mind. If
72+
using PPing with other protocols than TCP may therefore not be
73+
possible to distinguish flows with different protocols. PPing will
74+
therefore emit a warning if attempting to use the ppviz format with
75+
protocols other than TCP, but will still allow it.
76+
77+
Another piece of information tracked by PPing which can't be included
78+
in the ppviz format is if the calculated RTT includes the local
79+
processing delay or not (that is, it was timestamped on ingress and
80+
matched on egress instead of being timestamped on egress and matched
81+
on ingress). Currently this information is only included in the JSON
82+
format, but could potentially be added to the standard format if
83+
deemed important.
84+
85+
## Cannot detect end of ICMP "flow"
86+
ICMP is not a flow-based protocol, and therefore there is no signaling
87+
that the ICMP "flow" is about to close. Subsequently, there is not way
88+
for PPing to automatically detect that and ICMP flow has stopped and
89+
delete its flow-state entry (and send a timely flow closing event).
90+
91+
A consequence of this is that the ICMP flow entries will stick around
92+
and occupy a space in the flow state map until they are cleaned out by
93+
the periodic cleanup process. The current timeout for inactive flows
94+
is a very generous 5 minutes, which means a lot of short ICMP flows
95+
could quickly fill up the flow map and potentially block other flows
96+
for a long while.
97+
98+
## RTT-based sampling
99+
The RTT-based sampling features means that timestamp entries may only
100+
be created at an interval proportional to the flows RTT. This allows
101+
flows with shorter RTTs to get more frequent RTT samples than flows
102+
with long RTTs. However, as the flows RTT can only be updated based on
103+
the calculated RTT samples, this creates a situation where the RTTs
104+
update rate is dependent on itself. Flows with short RTTs will update
105+
the RTT more often, which in turn affects how often they can update
106+
the RTT.
107+
108+
This mainly becomes problematic if basing the sampling rate on the
109+
sRTT which may grow. In this case the sRTT will generally be prone to
110+
growing faster than it shrinks, as if it starts with a low RTT it will
111+
quickly update it to higher RTTs, but with high RTTs it will take
112+
longer for it do decrease to a lower RTT again.
113+
114+
## Concurrency issues
115+
116+
The program uses "global" (not `PERCPU`) hash maps to keep state. As
117+
the BPF programs need to see the global view to function properly,
118+
using `PERCPU` maps is not an option. The program must be able to
119+
match against stored packet timestamps regardless of the CPU the
120+
packets are processed on, and must also have a global view of the flow
121+
state in order for the sampling to work correctly.
122+
123+
As the BPF programs may run concurrently on different CPU cores
124+
accessing these global hash maps, this may result in some concurrency
125+
issues. In practice, I do not believe these will occur particularly
126+
often as the hash-map entries are per-flow, and I'm under the
127+
impression that packets from the same flow will typically be processed
128+
by the same CPU. Furthermore, most of the concurrency issues will not
129+
be that problematic even if they do occur. For now, I've therefore
130+
left these concurrency issues unattended, even if some of them could
131+
be avoided with atomic operations and/or spinlocks, in order to keep
132+
things simple and not hurt performance.
133+
134+
The (known) potential concurrency issues are:
135+
136+
### Tracking last seen identifier
137+
The tc/egress program keeps track of the last seen outgoing identifier
138+
for each flow, by storing it in the `flow_state` map. This is done to
139+
detect the first packet with a new identifier. If multiple packets are
140+
processed concurrently, several of them could potentially detect
141+
themselves as being first with the same identifier (which only matters
142+
if they also pass rate-limit check as well), alternatively if the
143+
concurrent packets have different identifiers there may be a lost
144+
update (but for TCP timestamps, concurrent packets would typically be
145+
expected to have the same timestamp).
146+
147+
A possibly more severe issue is out-of-order packets. If a packet with
148+
an old identifier arrives out of order, that identifier could be
149+
detected as a new identifier. If for example the following flow of
150+
four packets with just two different identifiers (id1 and id2) were to
151+
occur:
152+
153+
id1 -> id2 -> id1 -> id2
154+
155+
Then the tc/egress program would consider each of these packets to
156+
have new identifiers and try to create a new timestamp for each of
157+
them if the sampling strategy allows it. However even if the sampling
158+
strategy allows it, the (incorrect) creation of timestamps for id1 and
159+
id2 the second time would only be successful in case the first
160+
timestamps for id1 and id2 have already been matched against (and thus
161+
deleted). Even if that is the case, they would only result in
162+
reporting an incorrect RTT in case there are also new matches against
163+
these identifiers.
164+
165+
This issue could be avoided entirely by requiring that new-id > old-id
166+
instead of simply checking that new-id != old-id, as TCP timestamps
167+
should monotonically increase. That may however not be a suitable
168+
solution for other types of identifiers.
169+
170+
### Rate-limiting new timestamps
171+
In the tc/egress program packets to timestamp are sampled by using a
172+
per-flow rate-limit, which is enforced by storing when the last
173+
timestamp was created in the `flow_state` map. If multiple packets
174+
perform this check concurrently, it's possible that multiple packets
175+
think they are allowed to create timestamps before any of them are
176+
able to update the `last_timestamp`. When they update `last_timestamp`
177+
it might also be slightly incorrect, however if they are processed
178+
concurrently then they should also generate very similar timestamps.
179+
180+
If the packets have different identifiers, (which would typically not
181+
be expected for concurrent TCP timestamps), then this would allow some
182+
packets to bypass the rate-limit. By bypassing the rate-limit, the
183+
flow would use up some additional map space and report some additional
184+
RTT(s) more than expected (however the reported RTTs should still be
185+
correct).
186+
187+
If the packets have the same identifier, they must first have managed
188+
to bypass the previous check for unique identifiers (see [previous
189+
point](#tracking-last-seen-identifier)), and only one of them will be
190+
able to successfully store a timestamp entry.
191+
192+
### Matching against stored timestamps
193+
The XDP/ingress program could potentially match multiple concurrent
194+
packets with the same identifier against a single timestamp entry in
195+
`packet_ts`, before any of them manage to delete the timestamp
196+
entry. This would result in multiple RTTs being reported for the same
197+
identifier, but if they are processed concurrently these RTTs should
198+
be very similar, so would mainly result in over-reporting rather than
199+
reporting incorrect RTTs.
200+
201+
### Updating flow statistics
202+
Both the tc/egress and XDP/ingress programs will try to update some
203+
flow statistics each time they successfully parse a packet with an
204+
identifier. Specifically, they'll update the number of packets and
205+
bytes sent/received. This is not done in an atomic fashion, so there
206+
could potentially be some lost updates resulting an underestimate.
207+
208+
Furthermore, whenever the XDP/ingress program calculates an RTT, it
209+
will check if this is the lowest RTT seen so far for the flow. If
210+
multiple RTTs are calculated concurrently, then several could pass
211+
this check concurrently and there may be a lost update. It should only
212+
be possible for multiple RTTs to be calculated concurrently in case
213+
either the [timestamp rate-limit was
214+
bypassed](#rate-limiting-new-timestamps) or [multiple packets managed
215+
to match against the same
216+
timestamp](#matching-against-stored-timestamps).
217+
218+
It's worth noting that with sampling the reported minimum-RTT is only
219+
an estimate anyways (may never calculate RTT for packet with the true
220+
minimum RTT). And even without sampling there is some inherent
221+
sampling due to TCP timestamps only being updated at a limited rate
222+
(1000 Hz).
223+
224+
### Outputting flow opening/closing events
225+
A flow is not considered opened until a reply has been seen for
226+
it. The `flow_state` map keeps information about if the flow has been
227+
opened or not, which is checked and updated for each reply. The check
228+
and update of this information is not performed atomically, which may
229+
result in multiple replies thinking they are the first, emitting
230+
multiple flow-opened events, in case they are processed concurrently.
231+
232+
Likewise, when flows are closed it checks if the flow has been opened
233+
to determine if a flow closing message should be sent. If multiple
234+
replies are processed concurrently, it's possible one of them will
235+
update the flow-open information and emit a flow opening message, but
236+
another reply closing the flow without thinking it's ever been opened,
237+
thus not sending a flow closing message.

0 commit comments

Comments
 (0)