Substring search is one of the most common operations in text processing, and one of the slowest.
StringZilla was designed to supersede LibC and implement those core operations in CPU-friendly manner, using branchless operations, SWAR, and SIMD assembly instructions.
Notably, Rust has a memchr
crate that provides a similar functionality, and it's used in many popular libraries.
This repository provides basic benchmarking scripts for comparing the throughput of stringzilla
and memchr
.
For normal order and reverse order search, over ASCII and UTF8 input data, the following numbers can be expected.
ASCII ⏩ | ASCII ⏪ | UTF8 ⏩ | UTF8 ⏪ | |
---|---|---|---|---|
Intel: | ||||
memchr |
5.89 GB/s | 1.08 GB/s | 8.73 GB/s | 3.35 GB/s |
stringzilla |
8.37 GB/s | 8.21 GB/s | 11.21 GB/s | 11.20 GB/s |
Arm: | ||||
memchr |
6.38 GB/s | 1.12 GB/s | 13.20 GB/s | 3.56 GB/s |
stringzilla |
6.56 GB/s | 5.56 GB/s | 9.41 GB/s | 8.17 GB/s |
Average | 1.2x faster | 6.2x faster | - | 2.8x faster |
For Intel the benchmark was run on AWS
r7iz
instances with Sapphire Rapids cores. For Arm the benchmark was run on AWSr7g
instances with Graviton 3 cores. The ⏩ signifies forward search, and ⏪ signifies reverse order search. At the time of writing, the latest versions ofmemchr
andstringzilla
were used - 2.7.1 and 3.3.0, respectively.
Before running benchmarks, you can test your Rust environment running:
cargo install cargo-criterion --locked
HAYSTACK_PATH=README.md cargo criterion --jobs 8
On Windows using PowerShell you'd need to set the environment variable differently:
$env:HAYSTACK_PATH="README.md"
cargo criterion --jobs 8
As part of the benchmark, the input "haystack" file is whitespace-tokenized into an array of strings. In every benchmark iteration, a new "needle" is taken from that array of tokens. All inclusions of that token in the haystack are counted, and the throughput is calculated. This generally results in very stable and predictable results. The benchmark also includes a warm-up, to ensure that the CPU caches are filled and the results are not affected by cold start or SIMD-related frequency scaling.
For benchmarks on ASCII data I've used the English Leipzig Corpora Collection. It's 124 MB in size, 1'000'000 lines long, and contains 8'388'608 tokens of mean length 5.
wget --no-clobber -O leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt
HAYSTACK_PATH=leipzig1M.txt cargo criterion --jobs 8
For richer mixed UTF data, I've used the XL Sum dataset for multilingual extractive summarization. It's 4.7 GB in size (1.7 GB compressed), 1'004'598 lines long, and contains 268'435'456 tokens of mean length 8. To download, unpack, and run the benchmarks, execute the following bash script in your terminal:
wget --no-clobber -O xlsum.csv.gz https://github.com/ashvardanian/xl-sum/releases/download/v1.0.0/xlsum.csv.gz
gzip -d xlsum.csv.gz
HAYSTACK_PATH=xlsum.csv cargo criterion --jobs 8