|
25 | 25 | delayed ACKs, reordered or lost packets etc.
|
26 | 26 | - [x] Keep some per-flow state
|
27 | 27 | - Will likely be needed for the sampling
|
28 |
| - - [ ] Could potentially include keeping track of average RTT, which |
| 28 | + - [x] Could potentially include keeping track of average RTT, which |
29 | 29 | may be useful for some decisions (ex. how often to sample,
|
30 | 30 | when entry can be removed etc)
|
31 | 31 | - [x] Could potentially include keeping track of minimum RTT (as
|
|
41 | 41 | unnecessarily large, which slows down the cleaning and may block
|
42 | 42 | new entries
|
43 | 43 | - [ ] Use libxdp to load XDP program
|
44 |
| -- [ ] Add support for other hooks |
45 |
| - - Ex TC-BFP on ingress instead of XDP? |
46 | 44 |
|
47 | 45 | ## Done
|
48 | 46 | - [x] Clean up commits and add signed-off-by tags
|
|
63 | 61 | so that tools such as [ppviz](https://github.com/pollere/ppviz)
|
64 | 62 | works for both pping implementations.
|
65 | 63 | - [x] Add timestamps to output (as original pping)
|
| 64 | +- [x] Add support for other hooks |
| 65 | + - TC-BFP on ingress instead of XDP |
| 66 | + |
| 67 | +# Potential issues |
| 68 | +## Limited information in different output formats |
| 69 | +The ppviz format is a bit limited in what information it can |
| 70 | +include. One of these limitations is that it does not include any |
| 71 | +protocol information as it was designed with only TCP in mind. If |
| 72 | +using PPing with other protocols than TCP may therefore not be |
| 73 | +possible to distinguish flows with different protocols. PPing will |
| 74 | +therefore emit a warning if attempting to use the ppviz format with |
| 75 | +protocols other than TCP, but will still allow it. |
| 76 | + |
| 77 | +Another piece of information tracked by PPing which can't be included |
| 78 | +in the ppviz format is if the calculated RTT includes the local |
| 79 | +processing delay or not (that is, it was timestamped on ingress and |
| 80 | +matched on egress instead of being timestamped on egress and matched |
| 81 | +on ingress). Currently this information is only included in the JSON |
| 82 | +format, but could potentially be added to the standard format if |
| 83 | +deemed important. |
| 84 | + |
| 85 | +## Cannot detect end of ICMP "flow" |
| 86 | +ICMP is not a flow-based protocol, and therefore there is no signaling |
| 87 | +that the ICMP "flow" is about to close. Subsequently, there is not way |
| 88 | +for PPing to automatically detect that and ICMP flow has stopped and |
| 89 | +delete its flow-state entry (and send a timely flow closing event). |
| 90 | + |
| 91 | +A consequence of this is that the ICMP flow entries will stick around |
| 92 | +and occupy a space in the flow state map until they are cleaned out by |
| 93 | +the periodic cleanup process. The current timeout for inactive flows |
| 94 | +is a very generous 5 minutes, which means a lot of short ICMP flows |
| 95 | +could quickly fill up the flow map and potentially block other flows |
| 96 | +for a long while. |
| 97 | + |
| 98 | +## RTT-based sampling |
| 99 | +The RTT-based sampling features means that timestamp entries may only |
| 100 | +be created at an interval proportional to the flows RTT. This allows |
| 101 | +flows with shorter RTTs to get more frequent RTT samples than flows |
| 102 | +with long RTTs. However, as the flows RTT can only be updated based on |
| 103 | +the calculated RTT samples, this creates a situation where the RTTs |
| 104 | +update rate is dependent on itself. Flows with short RTTs will update |
| 105 | +the RTT more often, which in turn affects how often they can update |
| 106 | +the RTT. |
| 107 | + |
| 108 | +This mainly becomes problematic if basing the sampling rate on the |
| 109 | +sRTT which may grow. In this case the sRTT will generally be prone to |
| 110 | +growing faster than it shrinks, as if it starts with a low RTT it will |
| 111 | +quickly update it to higher RTTs, but with high RTTs it will take |
| 112 | +longer for it do decrease to a lower RTT again. |
| 113 | + |
| 114 | +## Concurrency issues |
| 115 | + |
| 116 | +The program uses "global" (not `PERCPU`) hash maps to keep state. As |
| 117 | +the BPF programs need to see the global view to function properly, |
| 118 | +using `PERCPU` maps is not an option. The program must be able to |
| 119 | +match against stored packet timestamps regardless of the CPU the |
| 120 | +packets are processed on, and must also have a global view of the flow |
| 121 | +state in order for the sampling to work correctly. |
| 122 | + |
| 123 | +As the BPF programs may run concurrently on different CPU cores |
| 124 | +accessing these global hash maps, this may result in some concurrency |
| 125 | +issues. In practice, I do not believe these will occur particularly |
| 126 | +often as the hash-map entries are per-flow, and I'm under the |
| 127 | +impression that packets from the same flow will typically be processed |
| 128 | +by the same CPU. Furthermore, most of the concurrency issues will not |
| 129 | +be that problematic even if they do occur. For now, I've therefore |
| 130 | +left these concurrency issues unattended, even if some of them could |
| 131 | +be avoided with atomic operations and/or spinlocks, in order to keep |
| 132 | +things simple and not hurt performance. |
| 133 | + |
| 134 | +The (known) potential concurrency issues are: |
| 135 | + |
| 136 | +### Tracking last seen identifier |
| 137 | +The tc/egress program keeps track of the last seen outgoing identifier |
| 138 | +for each flow, by storing it in the `flow_state` map. This is done to |
| 139 | +detect the first packet with a new identifier. If multiple packets are |
| 140 | +processed concurrently, several of them could potentially detect |
| 141 | +themselves as being first with the same identifier (which only matters |
| 142 | +if they also pass rate-limit check as well), alternatively if the |
| 143 | +concurrent packets have different identifiers there may be a lost |
| 144 | +update (but for TCP timestamps, concurrent packets would typically be |
| 145 | +expected to have the same timestamp). |
| 146 | + |
| 147 | +A possibly more severe issue is out-of-order packets. If a packet with |
| 148 | +an old identifier arrives out of order, that identifier could be |
| 149 | +detected as a new identifier. If for example the following flow of |
| 150 | +four packets with just two different identifiers (id1 and id2) were to |
| 151 | +occur: |
| 152 | + |
| 153 | +id1 -> id2 -> id1 -> id2 |
| 154 | + |
| 155 | +Then the tc/egress program would consider each of these packets to |
| 156 | +have new identifiers and try to create a new timestamp for each of |
| 157 | +them if the sampling strategy allows it. However even if the sampling |
| 158 | +strategy allows it, the (incorrect) creation of timestamps for id1 and |
| 159 | +id2 the second time would only be successful in case the first |
| 160 | +timestamps for id1 and id2 have already been matched against (and thus |
| 161 | +deleted). Even if that is the case, they would only result in |
| 162 | +reporting an incorrect RTT in case there are also new matches against |
| 163 | +these identifiers. |
| 164 | + |
| 165 | +This issue could be avoided entirely by requiring that new-id > old-id |
| 166 | +instead of simply checking that new-id != old-id, as TCP timestamps |
| 167 | +should monotonically increase. That may however not be a suitable |
| 168 | +solution for other types of identifiers. |
| 169 | + |
| 170 | +### Rate-limiting new timestamps |
| 171 | +In the tc/egress program packets to timestamp are sampled by using a |
| 172 | +per-flow rate-limit, which is enforced by storing when the last |
| 173 | +timestamp was created in the `flow_state` map. If multiple packets |
| 174 | +perform this check concurrently, it's possible that multiple packets |
| 175 | +think they are allowed to create timestamps before any of them are |
| 176 | +able to update the `last_timestamp`. When they update `last_timestamp` |
| 177 | +it might also be slightly incorrect, however if they are processed |
| 178 | +concurrently then they should also generate very similar timestamps. |
| 179 | + |
| 180 | +If the packets have different identifiers, (which would typically not |
| 181 | +be expected for concurrent TCP timestamps), then this would allow some |
| 182 | +packets to bypass the rate-limit. By bypassing the rate-limit, the |
| 183 | +flow would use up some additional map space and report some additional |
| 184 | +RTT(s) more than expected (however the reported RTTs should still be |
| 185 | +correct). |
| 186 | + |
| 187 | +If the packets have the same identifier, they must first have managed |
| 188 | +to bypass the previous check for unique identifiers (see [previous |
| 189 | +point](#tracking-last-seen-identifier)), and only one of them will be |
| 190 | +able to successfully store a timestamp entry. |
| 191 | + |
| 192 | +### Matching against stored timestamps |
| 193 | +The XDP/ingress program could potentially match multiple concurrent |
| 194 | +packets with the same identifier against a single timestamp entry in |
| 195 | +`packet_ts`, before any of them manage to delete the timestamp |
| 196 | +entry. This would result in multiple RTTs being reported for the same |
| 197 | +identifier, but if they are processed concurrently these RTTs should |
| 198 | +be very similar, so would mainly result in over-reporting rather than |
| 199 | +reporting incorrect RTTs. |
| 200 | + |
| 201 | +### Updating flow statistics |
| 202 | +Both the tc/egress and XDP/ingress programs will try to update some |
| 203 | +flow statistics each time they successfully parse a packet with an |
| 204 | +identifier. Specifically, they'll update the number of packets and |
| 205 | +bytes sent/received. This is not done in an atomic fashion, so there |
| 206 | +could potentially be some lost updates resulting an underestimate. |
| 207 | + |
| 208 | +Furthermore, whenever the XDP/ingress program calculates an RTT, it |
| 209 | +will check if this is the lowest RTT seen so far for the flow. If |
| 210 | +multiple RTTs are calculated concurrently, then several could pass |
| 211 | +this check concurrently and there may be a lost update. It should only |
| 212 | +be possible for multiple RTTs to be calculated concurrently in case |
| 213 | +either the [timestamp rate-limit was |
| 214 | +bypassed](#rate-limiting-new-timestamps) or [multiple packets managed |
| 215 | +to match against the same |
| 216 | +timestamp](#matching-against-stored-timestamps). |
| 217 | + |
| 218 | +It's worth noting that with sampling the reported minimum-RTT is only |
| 219 | +an estimate anyways (may never calculate RTT for packet with the true |
| 220 | +minimum RTT). And even without sampling there is some inherent |
| 221 | +sampling due to TCP timestamps only being updated at a limited rate |
| 222 | +(1000 Hz). |
| 223 | + |
| 224 | +### Outputting flow opening/closing events |
| 225 | +A flow is not considered opened until a reply has been seen for |
| 226 | +it. The `flow_state` map keeps information about if the flow has been |
| 227 | +opened or not, which is checked and updated for each reply. The check |
| 228 | +and update of this information is not performed atomically, which may |
| 229 | +result in multiple replies thinking they are the first, emitting |
| 230 | +multiple flow-opened events, in case they are processed concurrently. |
| 231 | + |
| 232 | +Likewise, when flows are closed it checks if the flow has been opened |
| 233 | +to determine if a flow closing message should be sent. If multiple |
| 234 | +replies are processed concurrently, it's possible one of them will |
| 235 | +update the flow-open information and emit a flow opening message, but |
| 236 | +another reply closing the flow without thinking it's ever been opened, |
| 237 | +thus not sending a flow closing message. |
0 commit comments