-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #116 from SpareCores/DEV-334
intro binserve benchmarks
- Loading branch information
Showing
7 changed files
with
177 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
--- | ||
# ~50 chars | ||
title: "New Benchmarks: Static Web Server Workloads" | ||
date: 2024-08-23 | ||
# ~100 character | ||
teaser: "Measuring the performance of a static HTTP server being bombarded with requests for 1 to 512 kb files." | ||
# 320x220 | ||
image: /assets/images/blog/thumbnails/static-http-server-benchmarks.webp | ||
image_alt: Hundreds of laptops and PCs connected to a central server, symbolizing a HTTP server benchmark scenario. | ||
author: Gergely Daroczi | ||
tags: [benchmark, performance, score] | ||
--- | ||
|
||
We recently received feedback on Twitter/X pointing out that comparing | ||
vCPUs across different instance generations doesn't make much sense: | ||
|
||
<div class="flex justify-center items-center mt-8 mb-6 w-full"> | ||
<a href="https://x.com/sszuecs/status/1825626542216511640" | ||
target="_blank" rel="noopener" | ||
class="max-w-[80%] !no-underline"> | ||
<img | ||
title="Tweet stating how useless it is to compare servers based on their vCPU count, along with mentions to example benchmarks." | ||
src="/assets/images/blog/binserve-twitter.webp"/> | ||
</a> | ||
</div> | ||
|
||
And we completely agree! In fact, we never advocated for comparing | ||
servers purely based on their specs. Actually, that's why we have already | ||
covered [50+ benchmark scores](/article/cloud-compute-performance-benchmarks) | ||
for the monitored ~2000 servers, including a highlighted CPU burning score that | ||
is presented in our all our comparison tables, even in the screenshot above. | ||
However, the examples shared in the tweet inspired us to dig deeper: | ||
|
||
<blockquote> | ||
<div> | ||
<p style="padding-top:5px; margin-bottom:0px; font-style: italic;"> | ||
I tested aws c5.large to c7i.large with redis, almost no gain and I | ||
tested skipper (http proxy for kubernetes ingress) with c6g.large | ||
compared to c7g.large -> 30% less cpu usage same work. | ||
</p> | ||
<p style="padding-bottom:5px; margin-top:10px;"> | ||
— Sandor Szücs (@sszuecs) on Aug 19, 2024 | ||
</p> | ||
</div> | ||
</blockquote> | ||
|
||
|
||
We will follow-up on the redis use-case, as we have several | ||
database-related benchmarks to be covered in our roadmap, but wanted | ||
to quickly react on the HTTP proxy workload. | ||
|
||
## Static Web Serving | ||
|
||
Probably the most popular webserver and reverse proxy nowadays is | ||
`nginx`, which is a fantastic tool with a lot of fancy features, but | ||
provides mediocore performance with the default config, and | ||
measurements highly depend on the actual configuration and | ||
fine-tuning. | ||
|
||
To simplify benchmarking, we chose | ||
<a href="https://github.com/mufeedvh/binserve" target="_blank" rel="noopener"><code>binserve</code></a>, | ||
a single-binary, very fast static web server written in Rust. | ||
It scales surprisingly well without any tuning at all, so can probably | ||
much better measure general static web serving capabilities of a | ||
server compared to any much more complex `nginx` (or other) | ||
configuration. It also stores the static files in memory, so the | ||
overhead of filesystem/storage operations can be neglected. | ||
|
||
## HTTP Benchmarking | ||
|
||
To measure the performance of the web server, we decided to use | ||
<a href="https://github.com/wg/wrk" target="_blank" rel="noopener"><code>wrk</code></a>, | ||
which is a modern, multi-threaded HTTP benchmarking tool written in C. | ||
|
||
We started `wrk` on the same server with `binserve`, and run it for | ||
10-10 seconds using a matrix of different number of client threads (1, | ||
2, 4) and open connections (1, 2, 4, 8, 16, 32) to query small (1 kb, | ||
16 kb, 64 kb) to large files (256 kb, 512 kb) — as smaller file sizes | ||
are likely to need more connections to saturate the machine. | ||
|
||
Running both the web server and the HTTP benchmarking tool on the same | ||
server is questionable, as although it reduces the network overhead | ||
and constraints, but both tools compete for system resources, see e.g.: | ||
|
||
<div class="flex justify-center items-center mt-8 mb-6 w-full"> | ||
<img | ||
title="Checking top while running the benchmarks, showing roughly 100/70 split between the load of binserve and wrk." | ||
src="/assets/images/blog/binserve-top.webp"/> | ||
</div> | ||
|
||
This is quite heavy client-side usage! So running both the server and | ||
the client on the same node is definitely a tradeoff, but as doing | ||
this benchmark in the same way on all the other instances, we consider | ||
this a fair comparison. | ||
|
||
We also recorded the server's and client's time spent executing in | ||
user/system mode, so we can use that ratio for extrapolating the | ||
expected server performance by trying to control for the client | ||
resource usage. | ||
|
||
Last methodological note: we did not ingest the benchmark scores of | ||
all individual runs, as e.g. the number of threads used by `wrk` is | ||
not a meaningful technical detail when it comes to evaluating the | ||
static webserver performance, so we simply picked the highest RPS | ||
thread count among the same connection count and file size | ||
combinations. | ||
|
||
If you are interested in more details, I'd recommend checking the | ||
actual benchmark script hosted in our `benchmark-web` Docker image | ||
(<a href="https://github.com/SpareCores/sc-images/blob/main/images/benchmark-web/benchmark.py" target="_blank" rel="noopener">`benchmark.py`</a>) and the related ETL script | ||
(<a href="https://github.com/SpareCores/sc-crawler/blob/9a49d76ff8379cbcddfbe5b348187c9809f24ecf/src/sc_crawler/inspector.py#L315-L376" target="_blank" rel="noopener">`inspect.py`</a>). | ||
|
||
## Results | ||
|
||
The original post mentioned ~30% diff between `c6g.large` and | ||
`c7g.large` when testing `skipper`, so we were excited to check if we | ||
have similar results: | ||
|
||
<div class="text-center m-2.5 mt-8 mb-6"> | ||
<img class="zoomin w-full" | ||
title="Requests per second when querying binserve on a single connection per vCPU using wrk." | ||
alt="Grouped bar chart showing the Requests per second when querying binserver on a single connection per vCPU using wrk on a c6g.large, c7g.large, c5.large, and c7i.large servers at AWS." | ||
src="/assets/images/blog/binserve-compare-plot.webp"/> | ||
<p>Performance of querying binserve on a single connection per vCPU<br />(data collected an visualized by Spare Cores)</p> | ||
</div> | ||
|
||
Overall, `c7g.large` is definitely more powerful than `c6g.large`, but | ||
the extra performance varies by a number of factors: for example, the | ||
advantage is only around 12% (45.7k VS 40.9k RPS) when querying 1k | ||
small files, while it's almost 40% (6.7k VS 4.8k) when serving much | ||
larger, 512k files. Similarly, more open connections shows an ever | ||
more drastic picture: | ||
|
||
<div class="text-center m-2.5 mt-8 mb-6"> | ||
<img class="zoomin w-full" | ||
title="Requests per second when querying binserve on 16 connections per vCPU using wrk." | ||
alt="Grouped bar chart showing the Requests per second when querying binserver on 16 connections per vCPU using wrk on a c6g.large, c7g.large, c5.large, and c7i.large servers at AWS." | ||
src="/assets/images/blog/binserve-compare-plot-16.webp"/> | ||
<p>Performance of querying binserve on 16 connections per vCPU<br />(data collected an visualized by Spare Cores)</p> | ||
</div> | ||
|
||
With small files and 16 open connections, `c7g.large` peaks at over | ||
120k requests per second (note that 3x speed bump compared to the | ||
above numbers): an almost 100% gain over `c6g.large` -- actually even | ||
outperforming the `c7i.large` in this specific workload. | ||
|
||
So depending on the size of data to be served and the number of | ||
concurrent connections, you might have better options either in the | ||
ARM or x86 instances. | ||
|
||
## Server Performance | ||
|
||
Again, the above RPS is **not** what you should expect from `binserve` | ||
when running on the referenced server, since `wrk` consumed some of | ||
the server's resources during the tests. | ||
|
||
For this end, we estimated an expected server RPS by extrapolating the | ||
measured RPS by multiplying it with the ratio of the client's and | ||
server's time spent executing in user/system mode. In other (stats) | ||
terms, trying to control for the client resource usage: | ||
|
||
<div class="text-center m-2.5 mt-8 mb-6"> | ||
<img class="zoomin w-full" | ||
title="Extrapolated requests per second when querying binserve on 16 connections per vCPU using wrk." | ||
alt="Grouped bar chart showing the Extrapolated RPS when querying binserver on 16 connections per vCPU using wrk on a c6g.large, c7g.large, c5.large, and c7i.large servers at AWS." | ||
src="/assets/images/blog/binserve-compare-plot-16-extrapolated.webp"/> | ||
<p>Extrapolated server performance on 16 connections per vCPU<br />(data collected an visualized by Spare Cores)</p> | ||
</div> | ||
|
||
|
||
## Further Metrics | ||
|
||
For those more interested in throughput rather than the number of | ||
requests per second, we have made both the raw and extrapolated values | ||
in our server details and server comparison pages. We have also | ||
recorded the average latency as reported by `wrk`, which might be | ||
useful depending on your use case. |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.