Skip to content

Benchmarking CPU usage

Kazuho Oku edited this page Jun 19, 2020 · 5 revisions

This page explains how the CPU usage of quicly serving at a fixed rate can be taken, by describing how the benchmark numbers shown in https://github.com/h2o/quicly/pull/359 have been taken.

Setup

Sender Receiver
CPU Intel Core i5 9400 @ 4GHz AMD Ryzen 3 3200G @ 3.6GHz
NIC Intel X550 Aquantia aqc107

Linux 5.6.5 was used on both endpoints.

To eliminate confusion caused by tasks being distributed differently among multiple CPU cores, both the sender and receiver had all the CPU cores disabled except one core. This was done by setting /sys/devices/system/cpu/cpuX/online to zero for all X but zero. CPU clocks were set to the respective rates by adjusting /sys/devices/system/cpu/cpu*/scaling_(min|max)_freq.

The throughput of the receiver was capped to 5gbe by using ethtool, as quicly/cli command running as client does not yet support GRO and therefore cannot handle data arriving at 10gbe line rate.

net.core.wmem_(default|max) were set to 2MB. net.core.rmem_(default|max) were set to 8MB.

To change the GSO size used by the quicly/cli command, change #define MAX_BURST_PACKETS 10 in src/cli.c to the desired number of packets.

Running Endpoints

quicly

On the sender-side, run cli -k server.key -c server.crt -y aes128gcmsha256 -G 0.0.0.0 4433.

On the receiver-side, run cli -M 10000000000 -m 10000000000 -p /20000000000 -O -u 1472 <sender-address> 4433. This instructs the receiver to fetch 20GB of data from the server, using a UDP datagrams with payload size of 1,472 bytes. It sets connection- and stream-level flow control to 10GB to avoid potential issues caused by flow control limits.

Note: quicly running as a server uses the payload size of the first datagram it receives for a given connection as the maximum payload size of the egress packets of that connection. Therefore, setting the UDP payload size of the client Initial packet changes the datagram size that the server uses (for testing jumbo packets, you also need to increase the maximum accepted payload size using the -U option).

picotls

On the receiver-side, run cli -k server.key -c server.crt -y aes128gcmsha256 -B 0.0.0.0 4433.

On the sender-side, run cli -B <receiver-address> 4433. This instructs the sender to continue pushing data at full speed. When enough data is taken, press Ctrl-C to stop the sender.

Changing MTU

At the IP-level, ifconfig command was used to change the MTU. In addition, -u and -U option of quicly/cli were used to change the initial UDP payload size as well as the maximum acceptable UDP payload size.

Collecting CPU and network usage

While running the endpoints at full speed, procstat was used for collecting the amount of CPU times being spent and also the amount of packets / bytes being transmitted.

To give an example, below is the output of procstat command, running sleep 1, while monitoring traffic on enp1s0f1.

$ procstat -nic enp1s0f1 sleep 1
user 0 nice 0 sys 0 idle 600 iowait 0 hardirq 0 softirq 0 rxpackets 3 rxbytes 180 txpackets 0 txbytes 0

After running sleep 1, the command reports that 600 ticks (6 core * 100 ticks; one tick is 10ms) were spent in idle state, while no measurable amount of time were spent in other categories (user, nice, sys, iowait, hardirq, softirq). The CPU usage of a transport congesting the network is calculated as 1 - idle / (user+nice+sys+idle+iowait+hardirq+softirq).

In this example, the command is also reporting that 3 packets (180 bytes in total) were received on the NIC during that period.

Measuring the cost of crypto

The endpoints were run using the perf command, and the cost of the functions were collected.