pping: Add section on potential issues to TODO-list

simosund · simosund · commit b8215e70b209 · 2022-02-10T16:17:10.000+01:00
Collect potential issues under a new section in the TODO list. These
are issues I generally don't think are that severe, but may still be
useful to note down and keep in mind.

Move the section on potential concurrency issues from README to the
new section in the TODO-list.

Signed-off-by: Simon Sundberg &lt;simon.sundberg@kau.se&gt;
diff --git a/pping/README.md b/pping/README.md
@@ -155,117 +155,6 @@ An example of a (pretty-printed) RTT-even is provided below:
 - **events:** A perf-buffer used by the BPF programs to push flow or RTT events
   to `pping.c`, which continuously polls the map the prints them out.
 
-### A note on concurrency
-The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
-programs need to see the global view to function properly, using `PERCPU` maps
-is not an option. The program must be able to match against stored packet
-timestamps regardless of the CPU the packets are processed on, and must also
-have a global view of the flow state in order for the sampling to work
-correctly.
-
-As the BPF programs may run concurrently on different CPU cores accessing these
-global hash maps, this may result in some concurrency issues. In practice, I do
-not believe these will occur particularly often, as I'm under the impression
-that packets from the same flow will typically be processed by the some
-CPU. Furthermore, most of the concurrency issues will not be that problematic
-even if they do occur. For now, I've therefore left these concurrency issues
-unattended, even if some of them could be avoided with atomic operations and/or
-spinlocks, in order to keep things simple and not hurt performance.
-
-The (known) potential concurrency issues are:
-
-#### Tracking last seen identifier
-The tc/egress program keeps track of the last seen outgoing identifier for each
-flow, by storing it in the `flow_state` map. This is done to detect the first
-packet with a new identifier. If multiple packets are processed concurrently,
-several of them could potentially detect themselves as being first with the same
-identifier (which only matters if they also pass rate-limit check as well),
-alternatively if the concurrent packets have different identifiers there may be
-a lost update (but for TCP timestamps, concurrent packets would typically be
-expected to have the same timestamp).
-
-A possibly more severe issue is out-of-order packets. If a packet with an old
-identifier arrives out of order, that identifier could be detected as a new
-identifier. If for example the following flow of four packets with just two
-different identifiers (id1 and id2) were to occur:
-
-id1 -> id2 -> id1 -> id2
-
-Then the tc/egress program would consider each of these packets to have new
-identifiers and try to create a new timestamp for each of them if the sampling
-strategy allows it. However even if the sampling strategy allows it, the
-(incorrect) creation of timestamps for id1 and id2 the second time would only be
-successful in case the first timestamps for id1 and id2 have already been
-matched against (and thus deleted). Even if that is the case, they would only
-result in reporting an incorrect RTT in case there are also new matches against
-these identifiers.
-
-This issue could be avoided entirely by requiring that new-id > old-id instead
-of simply checking that new-id != old-id, as TCP timestamps should monotonically
-increase. That may however not be a suitable solution for other types of
-identifiers.
-
-#### Rate-limiting new timestamps
-In the tc/egress program packets to timestamp are sampled by using a per-flow
-rate-limit, which is enforced by storing when the last timestamp was created in
-the `flow_state` map. If multiple packets perform this check concurrently, it's
-possible that multiple packets think they are allowed to create timestamps
-before any of them are able to update the `last_timestamp`. When they update
-`last_timestamp` it might also be slightly incorrect, however if they are
-processed concurrently then they should also generate very similar timestamps.
-
-If the packets have different identifiers, (which would typically not be
-expected for concurrent TCP timestamps), then this would allow some packets to
-bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
-additional map space and report some additional RTT(s) more than expected
-(however the reported RTTs should still be correct).
-
-If the packets have the same identifier, they must first have managed to bypass
-the previous check for unique identifiers (see [previous
-point](#tracking-last-seen-identifier)), and only one of them will be able to
-successfully store a timestamp entry.
-
-#### Matching against stored timestamps
-The XDP/ingress program could potentially match multiple concurrent packets with
-the same identifier against a single timestamp entry in `packet_ts`, before any
-of them manage to delete the timestamp entry. This would result in multiple RTTs
-being reported for the same identifier, but if they are processed concurrently
-these RTTs should be very similar, so would mainly result in over-reporting
-rather than reporting incorrect RTTs.
-
-#### Updating flow statistics
-Both the tc/egress and XDP/ingress programs will try to update some flow
-statistics each time they successfully parse a packet with an
-identifier. Specifically, they'll update the number of packets and bytes
-sent/received. This is not done in an atomic fashion, so there could potentially
-be some lost updates resulting an underestimate.
-
-Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
-if this is the lowest RTT seen so far for the flow. If multiple RTTs are
-calculated concurrently, then several could pass this check concurrently and
-there may be a lost update. It should only be possible for multiple RTTs to be
-calculated concurrently in case either the [timestamp rate-limit was
-bypassed](#rate-limiting-new-timestamps) or [multiple packets managed to match
-against the same timestamp](#matching-against-stored-timestamps).
-
-It's worth noting that with sampling the reported minimum-RTT is only an
-estimate anyways (may never calculate RTT for packet with the true minimum
-RTT). And even without sampling there is some inherent sampling due to TCP
-timestamps only being updated at a limited rate (1000 Hz).
-
-#### Outputting flow opening/closing events
-A flow is not considered opened until a reply has been seen for it. The
-`flow_state` map keeps information about if the flow has been opened or not,
-which is checked and updated for each reply. The check and update of this
-information is not performed atomically, which may result in multiple replies
-thinking they are the first, emitting multiple flow-opened events, in case they
-are processed concurrently.
-
-Likewise, when flows are closed it checks if the flow has been opened to
-determine if a flow closing message should be sent. If multiple replies are
-processed concurrently, it's possible one of them will update the flow-open
-information and emit a flow opening message, but another reply closing the flow
-without thinking it's ever been opened, thus not sending a flow closing message.
 
 ## Similar projects
 Passively measuring the RTT for TCP traffic is not a novel concept, and there
diff --git a/pping/TODO.md b/pping/TODO.md
@@ -25,7 +25,7 @@
         delayed ACKs, reordered or lost packets etc.
 - [x] Keep some per-flow state
   - Will likely be needed for the sampling
-  - [ ] Could potentially include keeping track of average RTT, which
+  - [x] Could potentially include keeping track of average RTT, which
         may be useful for some decisions (ex. how often to sample,
         when entry can be removed etc)
   - [x] Could potentially include keeping track of minimum RTT (as
@@ -41,8 +41,6 @@
     unnecessarily large, which slows down the cleaning and may block
     new entries
 - [ ] Use libxdp to load XDP program
-- [ ] Add support for other hooks
-  - Ex TC-BFP on ingress instead of XDP?
 
 ## Done
 - [x] Clean up commits and add signed-off-by tags
@@ -63,3 +61,177 @@
     so that tools such as [ppviz](https://github.com/pollere/ppviz)
     works for both pping implementations.
 - [x] Add timestamps to output (as original pping)
+- [x] Add support for other hooks
+  - TC-BFP on ingress instead of XDP
+
+# Potential issues
+## Limited information in different output formats
+The ppviz format is a bit limited in what information it can
+include. One of these limitations is that it does not include any
+protocol information as it was designed with only TCP in mind. If
+using PPing with other protocols than TCP may therefore not be
+possible to distinguish flows with different protocols. PPing will
+therefore emit a warning if attempting to use the ppviz format with
+protocols other than TCP, but will still allow it.
+
+Another piece of information tracked by PPing which can't be included
+in the ppviz format is if the calculated RTT includes the local
+processing delay or not (that is, it was timestamped on ingress and
+matched on egress instead of being timestamped on egress and matched
+on ingress). Currently this information is only included in the JSON
+format, but could potentially be added to the standard format if
+deemed important.
+
+## Cannot detect end of ICMP "flow"
+ICMP is not a flow-based protocol, and therefore there is no signaling
+that the ICMP "flow" is about to close. Subsequently, there is not way
+for PPing to automatically detect that and ICMP flow has stopped and
+delete its flow-state entry (and send a timely flow closing event).
+
+A consequence of this is that the ICMP flow entries will stick around
+and occupy a space in the flow state map until they are cleaned out by
+the periodic cleanup process. The current timeout for inactive flows
+is a very generous 5 minutes, which means a lot of short ICMP flows
+could quickly fill up the flow map and potentially block other flows
+for a long while.
+
+## RTT-based sampling
+The RTT-based sampling features means that timestamp entries may only
+be created at an interval proportional to the flows RTT. This allows
+flows with shorter RTTs to get more frequent RTT samples than flows
+with long RTTs. However, as the flows RTT can only be updated based on
+the calculated RTT samples, this creates a situation where the RTTs
+update rate is dependent on itself. Flows with short RTTs will update
+the RTT more often, which in turn affects how often they can update
+the RTT.
+
+This mainly becomes problematic if basing the sampling rate on the
+sRTT which may grow. In this case the sRTT will generally be prone to
+growing faster than it shrinks, as if it starts with a low RTT it will
+quickly update it to higher RTTs, but with high RTTs it will take
+longer for it do decrease to a lower RTT again.
+
+## Concurrency issues
+
+The program uses "global" (not `PERCPU`) hash maps to keep state. As
+the BPF programs need to see the global view to function properly,
+using `PERCPU` maps is not an option. The program must be able to
+match against stored packet timestamps regardless of the CPU the
+packets are processed on, and must also have a global view of the flow
+state in order for the sampling to work correctly.
+
+As the BPF programs may run concurrently on different CPU cores
+accessing these global hash maps, this may result in some concurrency
+issues. In practice, I do not believe these will occur particularly
+often as the hash-map entries are per-flow, and I'm under the
+impression that packets from the same flow will typically be processed
+by the same CPU. Furthermore, most of the concurrency issues will not
+be that problematic even if they do occur. For now, I've therefore
+left these concurrency issues unattended, even if some of them could
+be avoided with atomic operations and/or spinlocks, in order to keep
+things simple and not hurt performance.
+
+The (known) potential concurrency issues are:
+
+### Tracking last seen identifier
+The tc/egress program keeps track of the last seen outgoing identifier
+for each flow, by storing it in the `flow_state` map. This is done to
+detect the first packet with a new identifier. If multiple packets are
+processed concurrently, several of them could potentially detect
+themselves as being first with the same identifier (which only matters
+if they also pass rate-limit check as well), alternatively if the
+concurrent packets have different identifiers there may be a lost
+update (but for TCP timestamps, concurrent packets would typically be
+expected to have the same timestamp).
+
+A possibly more severe issue is out-of-order packets. If a packet with
+an old identifier arrives out of order, that identifier could be
+detected as a new identifier. If for example the following flow of
+four packets with just two different identifiers (id1 and id2) were to
+occur:
+
+id1 -> id2 -> id1 -> id2
+
+Then the tc/egress program would consider each of these packets to
+have new identifiers and try to create a new timestamp for each of
+them if the sampling strategy allows it. However even if the sampling
+strategy allows it, the (incorrect) creation of timestamps for id1 and
+id2 the second time would only be successful in case the first
+timestamps for id1 and id2 have already been matched against (and thus
+deleted). Even if that is the case, they would only result in
+reporting an incorrect RTT in case there are also new matches against
+these identifiers.
+
+This issue could be avoided entirely by requiring that new-id > old-id
+instead of simply checking that new-id != old-id, as TCP timestamps
+should monotonically increase. That may however not be a suitable
+solution for other types of identifiers.
+
+### Rate-limiting new timestamps
+In the tc/egress program packets to timestamp are sampled by using a
+per-flow rate-limit, which is enforced by storing when the last
+timestamp was created in the `flow_state` map. If multiple packets
+perform this check concurrently, it's possible that multiple packets
+think they are allowed to create timestamps before any of them are
+able to update the `last_timestamp`. When they update `last_timestamp`
+it might also be slightly incorrect, however if they are processed
+concurrently then they should also generate very similar timestamps.
+
+If the packets have different identifiers, (which would typically not
+be expected for concurrent TCP timestamps), then this would allow some
+packets to bypass the rate-limit. By bypassing the rate-limit, the
+flow would use up some additional map space and report some additional
+RTT(s) more than expected (however the reported RTTs should still be
+correct).
+
+If the packets have the same identifier, they must first have managed
+to bypass the previous check for unique identifiers (see [previous
+point](#tracking-last-seen-identifier)), and only one of them will be
+able to successfully store a timestamp entry.
+
+### Matching against stored timestamps
+The XDP/ingress program could potentially match multiple concurrent
+packets with the same identifier against a single timestamp entry in
+`packet_ts`, before any of them manage to delete the timestamp
+entry. This would result in multiple RTTs being reported for the same
+identifier, but if they are processed concurrently these RTTs should
+be very similar, so would mainly result in over-reporting rather than
+reporting incorrect RTTs.
+
+### Updating flow statistics
+Both the tc/egress and XDP/ingress programs will try to update some
+flow statistics each time they successfully parse a packet with an
+identifier. Specifically, they'll update the number of packets and
+bytes sent/received. This is not done in an atomic fashion, so there
+could potentially be some lost updates resulting an underestimate.
+
+Furthermore, whenever the XDP/ingress program calculates an RTT, it
+will check if this is the lowest RTT seen so far for the flow. If
+multiple RTTs are calculated concurrently, then several could pass
+this check concurrently and there may be a lost update. It should only
+be possible for multiple RTTs to be calculated concurrently in case
+either the [timestamp rate-limit was
+bypassed](#rate-limiting-new-timestamps) or [multiple packets managed
+to match against the same
+timestamp](#matching-against-stored-timestamps).
+
+It's worth noting that with sampling the reported minimum-RTT is only
+an estimate anyways (may never calculate RTT for packet with the true
+minimum RTT). And even without sampling there is some inherent
+sampling due to TCP timestamps only being updated at a limited rate
+(1000 Hz).
+
+### Outputting flow opening/closing events
+A flow is not considered opened until a reply has been seen for
+it. The `flow_state` map keeps information about if the flow has been
+opened or not, which is checked and updated for each reply. The check
+and update of this information is not performed atomically, which may
+result in multiple replies thinking they are the first, emitting
+multiple flow-opened events, in case they are processed concurrently.
+
+Likewise, when flows are closed it checks if the flow has been opened
+to determine if a flow closing message should be sent. If multiple
+replies are processed concurrently, it's possible one of them will
+update the flow-open information and emit a flow opening message, but
+another reply closing the flow without thinking it's ever been opened,
+thus not sending a flow closing message.