-
Notifications
You must be signed in to change notification settings - Fork 4
Home
An ATM carrier is peculiar in that it will encapsulate its payload in so called cells, with one ATM cell of 53 bytes carrying 48 bytes of payload. Why? According to the "Network Sales and Services Handbook" by Matthew J. Castelli (Cisco Press, 2002), page 135 FAQ:
"The number 53 may be the first time a binary number is not used in networking, so how did ATM cells come to be 53 bytes in length?
During the standardization process, a conflict arose within the CCITT as to the payload size within an ATM cell. The U.S. wanted 64-byte payloads; the Europeans and Japanese wanted 32 payloads. The 48-byte payload plus the 5-byte header was the compromise. The 5-byte header was chosen because 10 percent of payload (5 bytes) was perceived as the upper bound on the acceptable overhead."
That in itself is not a real problem besides that 10% overhead is quite a bit, given that ATM did not become universal and hence data will drag in its own headers and overhead on top of this 10%. Often ATM is only used on the last-mile link and basically none of the features ATM supports with its overhead matter at all, so this 10% truly is lost. In theory the ITU ADSL2 standard document G992-3 (2009-04) Annex N allows the use of PTM instead of ATM on such last mile links, reducing the overhead from 100-10048/53 = 9.4% to 100-10064/65 = 1.5%, but so far it seems this has not been implemented yet...
From a typical user's perspective this results in a 100-100*48/53 = 9.4% bandwidth cost purely for the "redundant" ATM overhead. But it gets worse, typically data payload packets are transported using the ATM adaptation layer 5 (AAL5 for short) that adds another 8 bytes at the end of the last ATM cell the payload occupies. Unfortunately this means that each AAL5 encapsulated packet will always occupy an integer number of ATM cells, the last cell will contain from 0 to 47 bytes of padding (well not really, the padding bytes can actually be spread over the last two cells, if after the end of the payload there is insufficient space for the AAL5 trailer). The result of this encapsulation method is that the time to transfer data over an ATM link does not simply increase linearly with the size of the data packets, but since we always need to send an integer number of cells (and need to look at the last bytes of the last cell) the transfer times will increase in steps. The spatial extend of these steps depends on the maximum payload per cell or 48 bytes, the "height" of each step depends on the time required to transfer an additional ATM cell of 53 bytes at a given uni-directional bandwidth.
So the whole idea of the ATM detector is to send a number of packets of all possible sizes for a given range and then see whether the transfer time increases in a step-wise fashion. It turns out that sending an ICM echo request (colloquially "ping") to a nearby server gives in the reported round-trip time (RTT) a useable estimate of the round-trip transfer time. Note that the RTT is an aggregate measure of the transfer time from the host to the server and back (plus a tiny bit of processing time on the server). As long as the ATM link is the slowest link in the path (the bottleneck link) this RTT gives a decent estimate of the accumulated delay of the ATM link. Since most ATM data links are provisioned with different incoming and outgoing bandwidths (typically incoming >> outgoing), the RTT does not really allow to estimate the ATM links bidirectional bandwidths. Since the ATM/AAL5 encapsulation is going to happen on both "legs" of the link and the transfer time aggregate will be dominated by the slower of the two bandwidth this luckily does not affect the suitability of ICMP RTTs as probes to measure ATM cell-ing.
Unfortunately, using (even close by) servers on the internet and local host computers (instead of just the endpoints of the ATM link) we introduce additional sources of variance in our measurement. Fortunately it seems we can overcome this by simple increasing the number of ICMP probes.
In theory the minimum transfer time (or minimum RTT of many ICMP probes) should be the best estimate for the transfer time; in reality however the mean, the geometric mean and especially the median seem much cleaner.
Tl;dr:
The robust mean and the robust geometric mean are just men and geometric mean taken after excluding the 10% of the data points at both extremes of the RTT distribution (so in total this removes the most extreme 20%). Positive example:
Negative example:
Note how the stair fit actually has a negative slope, since this is "impossible" for ATM this is certainly not showing ATM cell quantization (and how could it, since it was measured over a DOCSIS cable link).
For downstream packets, the DOCSIS MAC header consists of the standard 6-byte MAC Header, plus 6 bytes for the DS EHDR and 5 bytes for the Downstream Privacy EH Element. So, 17+14+4=35 bytes of per-packet overhead for a downstream DOCSIS MAC Frame.
For upstream packets, there will typically be only the 6-byte MAC Header, plus the 4 byte Upstream Privacy EH version 2 Element, so 10+14+4=28 bytes of per-packet overhead for an Upstream DOCSIS MAC Frame.
The 14+4 denotes the 14 byte ethernet header (6 byte dest mac, 6 byte src mac, 2 byte ether type) plus the 4 byte ethernet frame check sequence FCS.
Interestingly, the above DICSIS observations seem irrelevant for end customers as the shaper used in DOCSIS systems that limits a users maximal bandwidth does completely ignore DOCSIS overhead and only includes ethernet frames including their frame check sequence (FCS 4 Byte). Tocite the relevant section from the Docsis standard (http://www.cablelabs.com/specification/docsis-3-0-mac-and-upper-layer-protocols-interface-specification/):
"C.2.2.7.2 Maximum Sustained Traffic Rate 632 This parameter is the rate parameter R of a token-bucket-based rate limit for packets. R is expressed in bits per second, and MUST take into account all MAC frame data PDU of the Service Flow from the byte following the MAC header HCS to the end of the CRC, including every PDU in the case of a Concatenated MAC Frame. This parameter is applied after Payload Header Suppression; it does not include the bytes suppressed for PHS. The number of bytes forwarded (in bytes) is limited during any time interval T by Max(T), as described in the expression: Max(T) = T * (R / 8) + B, (1) where the parameter B (in bytes) is the Maximum Traffic Burst Configuration Setting (refer to Annex C.2.2.7.3). NOTE: This parameter does not limit the instantaneous rate of the Service Flow. The specific algorithm for enforcing this parameter is not mandated here. Any implementation which satisfies the above equation is conformant. In particular, the granularity of enforcement and the minimum implemented value of this parameter are vendor specific. The CMTS SHOULD support a granularity of at most 100 kbps. The CM SHOULD support a granularity of at most 100 kbps. NOTE: If this parameter is omitted or set to zero, then there is no explicitly-enforced traffic rate maximum. This field specifies only a bound, not a guarantee that this rate is available."
So in essence DOCSIS users need to only account for 18 Bytes of ethernet overhead in both ingress and egress directions under non-congested conditions.
Overall GPON encapsulation seems not easily predictable (due to some potentially variable length per GTC frame), but at least is seems clear that (G)PON's GPON Encapsulation Method (GEM) will add 5 bytes to each ethernet packet, and will contain most ethernet overhead including the 4 byte Frame Check Sequence (FCS), the 6 byte destination MAC, the 6 byte source MAC, the 2 byte ethertype (total = 18 byte) and potentially one to two 4 byte VLAN tags. It will not include the 7 byte Preamble, the 1 byte start of frame delimiter (SFD), and the 12 byte inter frame gap (IFG) (total = 20 Byte). Based on this information for GPON the Overhead on top of the MTU should be >= 18+5 = 23 bytes (18+4+5 = 27 bytes with a single VLAN tag, or 18+4+4+5 = 31 Bytes with QinQ double VLAN tags). It is currently unclear to me, how to account for the variable length up- and downstream management information-messages that are send in-band with the user data (downstream looks relative benign, but upstream has too many moving parts).
measuring overhead precisely is hard, but here is a quick and dirty approach of at least sanity checking a selected per-packet-overhead estimate
This is from the OpenWrt SQM Details wiki page, but since I wrote it over there, I can copy it over here, where I might edit it further. It assumes one uses the per-packet-overhead as part of competent traffic shaping (which IMHO is the most probable use-case for wanting to deduce the per-packet-overhead in the first place).
Now, the real challenge with the shaper gross rate and the per-packet-overhead is that they are not independent; say a link has a true gross rate of 100 rate-units and a true per-packet-overhead of 100 bytes (numbers are unrealistic, but allow for easier math) and an payload size of 1000 bytes, the expected throughput at the ethernet payload level is:
gross-rate * ((payload-size) / (pay_load-size + per-packet-overhead))
100 * ((1000) / (1000+100)) = 90.91
now, any combination of gross-shaper rate and per-packet-overhead, that results in a throughput <= 90.91 will effectively remove bufferbloat (that is not fully correct for downstream shaping, but the logic also holds if we aim for say, 90% of 90.91 instead).
so in the extreme we can set the per-packet-overhead to 0 as long as we also set the shaper gross speed to 90.91:
90.91 * (1000+0) / (1000) = 90.91
90.91 * ((1000) / (1000+0)) = 90.91
or the other way around, if we set the per-packet-overhead to an absurd 1000 bytes, we still will see the expected throughput if we also configure the shaper gross rate at 182:
90.91 * (1000+1000) / (1000) = 181.82
181.82 * ((1000) / (1000+ 1000)) = 90.91
To sanity check whether a given combination of gross rate and per-packet-overhead seems sane (say, there is too little information about the true link properties available to make an educated guess) ione needs to repeat speedtests at different packet sizes. The following stanza added to /etc/firewall.user will use OpenWrt's MSS clamping to bidirectionally force the MTU to 216 (as e.g. Macosx will not accept smaller MSS values IIRC)
# special rules to allow MSS clamping for in and outbound traffic
# use ip6tables -t mangle -S ; iptables -t mangle -S to check
forced_MSS=216
# affects both down- and upstream, egress seems to require at least 216
iptables -t mangle -A FORWARD -p tcp -m tcp --tcp-flags SYN,RST SYN -m comment --comment "custom: Zone wan MTU fixing" -j TCPMSS --set-mss ${forced_MSS}
ip6tables -t mangle -A FORWARD -p tcp -m tcp --tcp-flags SYN,RST SYN -m comment --comment "custom6: Zone wan MTU fixing" -j TCPMSS --set-mss ${forced_MSS}
Now, if we plug this into the numbers from above we get (note, MSS is the TCP/IP payload size, which in the IPv4 case is 40 bytes smaller than the ethernet payload):
100 * ((216+40) / (216+40+100)) = 68.3544303797 # as expected the throughput is smaller, since the fraction of overhead is simply larger
now, if we underestimated the per-packet-overhead we get:
90.91 * ((216) / (216 +0)) = 90.91
since 90 >> 68 we will admit too much data into the link and will encounter bufferbloat.
And the reverse error:
181.82 * ((216) / (216 + 1000)) = 32.2969736842
here we do not get bufferbloat (since 32 << 68) but we sacrifice way to much throughput.
So the proposal is to "optimize" shaper gross-rate and per-packet-overhead at the normal MSS value and then measure at a considerable smaller MSS to confirm whether both bufferbloat and throughput are still acceptable.
Please note one additional challenge here: testing a saturating load with small(er) packets will result in a considerably higher rate of packets the router needs to process (e.g. if you switch from MSS 1460 to MSS 146 you can expect ~10 times as many packets) and not all routers are capable of saturating a link with small packets, so for this test it is essential to confirm that the router does not run out of CPU cycles to process the data and as a consequence that the measured throughput is close to the theoretically expected one.
Please note to compare throughput measured with on-line speedtests with the theoretical prediction the following approximate formula can be used:
gross-rate * ((IP-packet-size - IP-header-size - TCP-header-size) / (IP-packet-size + per-packet-overhead))
e.g. for an ethernet link (effectively 38B overhead) with a VLAN tag (4B) and PPPoE (6+2=8B), IPv4 (without options: 20B), TCP (with rfc 1323 timestamps: 20+12=32B)
one can expect ~93% throughput
100 * ((1500 - 8 - 20 - 20 - 12) / (1500 + 38 + 4)) = 93.39