Skip to content

Commit

Permalink
Merge pull request #116 from SpareCores/DEV-334
Browse files Browse the repository at this point in the history
intro binserve benchmarks
  • Loading branch information
daroczig authored Sep 8, 2024
2 parents b80d5fe + 187ec44 commit 41bc135
Show file tree
Hide file tree
Showing 7 changed files with 177 additions and 0 deletions.
177 changes: 177 additions & 0 deletions src/assets/articles/benchmark-static-webserver.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
---
# ~50 chars
title: "New Benchmarks: Static Web Server Workloads"
date: 2024-08-23
# ~100 character
teaser: "Measuring the performance of a static HTTP server being bombarded with requests for 1 to 512 kb files."
# 320x220
image: /assets/images/blog/thumbnails/static-http-server-benchmarks.webp
image_alt: Hundreds of laptops and PCs connected to a central server, symbolizing a HTTP server benchmark scenario.
author: Gergely Daroczi
tags: [benchmark, performance, score]
---

We recently received feedback on Twitter/X pointing out that comparing
vCPUs across different instance generations doesn't make much sense:

<div class="flex justify-center items-center mt-8 mb-6 w-full">
<a href="https://x.com/sszuecs/status/1825626542216511640"
target="_blank" rel="noopener"
class="max-w-[80%] !no-underline">
<img
title="Tweet stating how useless it is to compare servers based on their vCPU count, along with mentions to example benchmarks."
src="/assets/images/blog/binserve-twitter.webp"/>
</a>
</div>

And we completely agree! In fact, we never advocated for comparing
servers purely based on their specs. Actually, that's why we have already
covered [50+ benchmark scores](/article/cloud-compute-performance-benchmarks)
for the monitored ~2000 servers, including a highlighted CPU burning score that
is presented in our all our comparison tables, even in the screenshot above.
However, the examples shared in the tweet inspired us to dig deeper:

<blockquote>
<div>
<p style="padding-top:5px; margin-bottom:0px; font-style: italic;">
I tested aws c5.large to c7i.large with redis, almost no gain and I
tested skipper (http proxy for kubernetes ingress) with c6g.large
compared to c7g.large -> 30% less cpu usage same work.
</p>
<p style="padding-bottom:5px; margin-top:10px;">
— Sandor Szücs (@sszuecs) on Aug 19, 2024
</p>
</div>
</blockquote>


We will follow-up on the redis use-case, as we have several
database-related benchmarks to be covered in our roadmap, but wanted
to quickly react on the HTTP proxy workload.

## Static Web Serving

Probably the most popular webserver and reverse proxy nowadays is
`nginx`, which is a fantastic tool with a lot of fancy features, but
provides mediocore performance with the default config, and
measurements highly depend on the actual configuration and
fine-tuning.

To simplify benchmarking, we chose
<a href="https://github.com/mufeedvh/binserve" target="_blank" rel="noopener"><code>binserve</code></a>,
a single-binary, very fast static web server written in Rust.
It scales surprisingly well without any tuning at all, so can probably
much better measure general static web serving capabilities of a
server compared to any much more complex `nginx` (or other)
configuration. It also stores the static files in memory, so the
overhead of filesystem/storage operations can be neglected.

## HTTP Benchmarking

To measure the performance of the web server, we decided to use
<a href="https://github.com/wg/wrk" target="_blank" rel="noopener"><code>wrk</code></a>,
which is a modern, multi-threaded HTTP benchmarking tool written in C.

We started `wrk` on the same server with `binserve`, and run it for
10-10 seconds using a matrix of different number of client threads (1,
2, 4) and open connections (1, 2, 4, 8, 16, 32) to query small (1 kb,
16 kb, 64 kb) to large files (256 kb, 512 kb) — as smaller file sizes
are likely to need more connections to saturate the machine.

Running both the web server and the HTTP benchmarking tool on the same
server is questionable, as although it reduces the network overhead
and constraints, but both tools compete for system resources, see e.g.:

<div class="flex justify-center items-center mt-8 mb-6 w-full">
<img
title="Checking top while running the benchmarks, showing roughly 100/70 split between the load of binserve and wrk."
src="/assets/images/blog/binserve-top.webp"/>
</div>

This is quite heavy client-side usage! So running both the server and
the client on the same node is definitely a tradeoff, but as doing
this benchmark in the same way on all the other instances, we consider
this a fair comparison.

We also recorded the server's and client's time spent executing in
user/system mode, so we can use that ratio for extrapolating the
expected server performance by trying to control for the client
resource usage.

Last methodological note: we did not ingest the benchmark scores of
all individual runs, as e.g. the number of threads used by `wrk` is
not a meaningful technical detail when it comes to evaluating the
static webserver performance, so we simply picked the highest RPS
thread count among the same connection count and file size
combinations.

If you are interested in more details, I'd recommend checking the
actual benchmark script hosted in our `benchmark-web` Docker image
(<a href="https://github.com/SpareCores/sc-images/blob/main/images/benchmark-web/benchmark.py" target="_blank" rel="noopener">`benchmark.py`</a>) and the related ETL script
(<a href="https://github.com/SpareCores/sc-crawler/blob/9a49d76ff8379cbcddfbe5b348187c9809f24ecf/src/sc_crawler/inspector.py#L315-L376" target="_blank" rel="noopener">`inspect.py`</a>).

## Results

The original post mentioned ~30% diff between `c6g.large` and
`c7g.large` when testing `skipper`, so we were excited to check if we
have similar results:

<div class="text-center m-2.5 mt-8 mb-6">
<img class="zoomin w-full"
title="Requests per second when querying binserve on a single connection per vCPU using wrk."
alt="Grouped bar chart showing the Requests per second when querying binserver on a single connection per vCPU using wrk on a c6g.large, c7g.large, c5.large, and c7i.large servers at AWS."
src="/assets/images/blog/binserve-compare-plot.webp"/>
<p>Performance of querying binserve on a single connection per vCPU<br />(data collected an visualized by Spare Cores)</p>
</div>

Overall, `c7g.large` is definitely more powerful than `c6g.large`, but
the extra performance varies by a number of factors: for example, the
advantage is only around 12% (45.7k VS 40.9k RPS) when querying 1k
small files, while it's almost 40% (6.7k VS 4.8k) when serving much
larger, 512k files. Similarly, more open connections shows an ever
more drastic picture:

<div class="text-center m-2.5 mt-8 mb-6">
<img class="zoomin w-full"
title="Requests per second when querying binserve on 16 connections per vCPU using wrk."
alt="Grouped bar chart showing the Requests per second when querying binserver on 16 connections per vCPU using wrk on a c6g.large, c7g.large, c5.large, and c7i.large servers at AWS."
src="/assets/images/blog/binserve-compare-plot-16.webp"/>
<p>Performance of querying binserve on 16 connections per vCPU<br />(data collected an visualized by Spare Cores)</p>
</div>

With small files and 16 open connections, `c7g.large` peaks at over
120k requests per second (note that 3x speed bump compared to the
above numbers): an almost 100% gain over `c6g.large` -- actually even
outperforming the `c7i.large` in this specific workload.

So depending on the size of data to be served and the number of
concurrent connections, you might have better options either in the
ARM or x86 instances.

## Server Performance

Again, the above RPS is **not** what you should expect from `binserve`
when running on the referenced server, since `wrk` consumed some of
the server's resources during the tests.

For this end, we estimated an expected server RPS by extrapolating the
measured RPS by multiplying it with the ratio of the client's and
server's time spent executing in user/system mode. In other (stats)
terms, trying to control for the client resource usage:

<div class="text-center m-2.5 mt-8 mb-6">
<img class="zoomin w-full"
title="Extrapolated requests per second when querying binserve on 16 connections per vCPU using wrk."
alt="Grouped bar chart showing the Extrapolated RPS when querying binserver on 16 connections per vCPU using wrk on a c6g.large, c7g.large, c5.large, and c7i.large servers at AWS."
src="/assets/images/blog/binserve-compare-plot-16-extrapolated.webp"/>
<p>Extrapolated server performance on 16 connections per vCPU<br />(data collected an visualized by Spare Cores)</p>
</div>


## Further Metrics

For those more interested in throughput rather than the number of
requests per second, we have made both the raw and extrapolated values
in our server details and server comparison pages. We have also
recorded the average latency as reported by `wrk`, which might be
useful depending on your use case.
Binary file not shown.
Binary file not shown.
Binary file added src/assets/images/blog/binserve-compare-plot.webp
Binary file not shown.
Binary file added src/assets/images/blog/binserve-top.webp
Binary file not shown.
Binary file added src/assets/images/blog/binserve-twitter.webp
Binary file not shown.
Binary file not shown.

0 comments on commit 41bc135

Please sign in to comment.