Skip to content

Commit

Permalink
Add basic re2 comparison bench
Browse files Browse the repository at this point in the history
Both re2/bench and examples/bench read a file of regexes (one per
line) and a file of user agents (one per line), build a filtered-regex
thingie, then for each for the first matching user agent look for the
first matching regex.

They both take a repeat count to go through the list multiple times,
as the sample ~75k user agents is a bit short to get relevant
data (both bench programs go through it in about half a second). This
also amortises the setup cost compared to the processing cost, not
that it's not relevant, but it's probably the least relevant.

Measurements
============

Sadly, while the Rust API is short and sweet and convenient and C++ is
a bit of a hellscape (I still need to find how to properly parse
flags), it turns out re2 is *much* more efficient than this v0 of
regex-filtered with regex-filtered running ~40% slower and needing
~40% more cycles which track (though it only retires 20% more
instructions).

Note that these are runs with 100 iterations in order to get a good
enough sampling and suppress the setup cost, as the matching is what
we really want. This is measured using `time(1)`[^1] on macOS 14.5
(Sonoma) with the `-l` option for expanded metadata:

re2
---

```
       46.99 real        46.87 user         0.02 sys
            53379072  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                4031  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  50  involuntary context switches
        555295294264  instructions retired
        151305788537  cycles elapsed
            46548672  peak memory footprint
```

regex-filtered
--------------

```
       64.95 real        64.67 user         0.02 sys
           145571840  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                9021  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  53  involuntary context switches
        656605947158  instructions retired
        208754382042  cycles elapsed
           142198656  peak memory footprint
```

regex-filtered is significantly slower
======================================

At glance it's not clear where: initially I was thinking it was in
not reusing the matching atoms between runs, however in regex-filtered
that's an iterator, so there's no difference there. The other
transient allocations during matching should be:

- A vec for the entry ids (misnamed `matched_atom_ids`), preallocated
  in neither.
- A map for the work (`work`) -- this is initialised to an ordered set
  copy of `matched_atom_ids` so regex-filtered actually does that
  directly (though it then expands `matched_atom_ids`'s size to the
  number of entries, which is how `re2` preallocates `work`).
- A map of propagation counts, preallocated to the entries count in
  both.
- A matching regex set (`regexps_map`), preallocated to the total
  number of regex in.
- A `regexps` vector to store and return both unfiltered and filtered
  regex indexes, preallocated to the combination in regex-filtered but
  not in re2, however they should both end up at the same size since
  the combination (`unfiltered` + `regexps_map`) is what both are
  putting in.

OTOH re2 does use a dedicated and bespoke `SparseArray` for much of
the per-match work (`IntMap` is a `SparseArray<int>`), and after
looking closer there that turns out to be quite is relevant:
`SparseArray` is essentially the same concept as `IndexSet` except it
skips the hashing phase entirely by having a fixed size for the
frontend, so there's a sparse array of size `len` and that just
contains an index into the dense array, which keeps the ordering of
items and allows fast iteration.

- For `matched_atom_ids`, it's useful to cheaply dedup new entries,
  which need to be kept in order, the code is already using an
  `IndexSet` but it's using the default hash which is a huge overhead
  compared to "no hash", switching to `FxHash` or `NoHash` would
  likely a major improvement here barring implementing our own
  `SparseArray`.
- For `count`, the `SparseArray` seems like an unnecessary
  complication altogether, this should just be a `vec` of size
  `entries.len()`, that removes useless overhead and that's it,
  furthermore *that* could be stack-allocated up to a limit (hello
  tinyvec), this should be a significant gain compared to our current
  hashmap I'm realising is *really* sub-par.
- `regexps` would certainly benefit from the same as
  `matched_atom_ids` especially compared to the current `HashSet`

regex-filtered needs 3x the memory
====================================

regex-filtered also needs a lot more memory (3x the original). This is
mostly though not exclusively in the setup phase:,

re2 has a peak rss of:

-  42860544 with 0 iterations
-  54083584 with 1 iteration
-  53166080 with 10 iterations (there's some variations between run)

regex-filtered has a peak rss of:

- 110968832 with 0 iterations
- 144326656 with 1 iteration
- 144965632 with 10 iterations

So after the first iteration both are mostly stable, and both grow by
~30% between the setup and the first iteration.

That... might be in large part because regex-filtered uses `usize`
while re2 works off of `int` indices actually, the difference is 2.5x
so there's clearly additional space being lost somewhere, but the
system works almost entirely off of indices (of entries, of atoms, of
regexes) which would translate to a ~2x growth, the rest might be
`regex::Regex` being larger than `re2::RE2` (to investigate).

[^1] the difference is nowhere fine enough that we need something else
     to investigate it
  • Loading branch information
masklinn committed Jun 22, 2024
1 parent b2b16b2 commit 1af15c1
Show file tree
Hide file tree
Showing 8 changed files with 75,981 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,6 @@
/target
Cargo.lock
.DS_Store
*.dSYM/
regex-filtered/re2/flake.lock
regex-filtered/re2/bench
3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[workspace]
members = ["regex-filtered", "ua-parser"]
resolver = "2"

[profile.release]
debug = true
50 changes: 50 additions & 0 deletions regex-filtered/examples/bench.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
use clap::Parser;
use std::path::PathBuf;
use std::io::{BufRead, BufReader};

#[derive(Parser)]
struct Args {
/// regexes file (one per line)
regexes: PathBuf,
/// user agents (one per line)
user_agents: PathBuf,
#[arg(short, long, default_value_t = 1)]
repetitions: usize,
#[arg(short, long, default_value_t = false)]
quiet: bool,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
let Args {regexes, user_agents, repetitions, quiet} = Args::parse();

let start = std::time::Instant::now();
let regexes = BufReader::new(std::fs::File::open(regexes)?)
.lines()
.collect::<Result<Vec<String>, _>>()?;

let f = regex_filtered::Builder::new()
.push_all(&regexes)?
.build()?;
eprintln!("{} regexes in {}s", regexes.len(), start.elapsed().as_secs_f32());

let start = std::time::Instant::now();
let user_agents = BufReader::new(std::fs::File::open(user_agents)?)
.lines()
.collect::<Result<Vec<String>, _>>()?;
eprintln!("{} user agents in {}s", user_agents.len(), start.elapsed().as_secs_f32());

for _ in 0..repetitions {
for ua in user_agents.iter() {
let n = f.matching(ua).next();
if !quiet {
if let Some((n, _)) = n {
println!("{n:3}");
} else {
println!();
}
}
}
}

Ok(())
}
10 changes: 10 additions & 0 deletions regex-filtered/re2/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
CXXFLAGS += -std=c++20 -Wall -Werror -g -fPIC -O3
LDFLAGS += -lre2

.PHONY: clean

bench: bench.cpp
$(CXX) $(CXXFLAGS) $(LDFLAGS) $^ -o $@

clean:
@rm bench
104 changes: 104 additions & 0 deletions regex-filtered/re2/bench.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#include <chrono>
#include <fstream>
#include <iostream>
#include <re2/filtered_re2.h>
#include <re2/re2.h>
#include <re2/set.h>

using namespace std::chrono;
using namespace std::literals;

template<typename T>
std::ostream& operator<< (std::ostream& out, const std::vector<T>& v) {
out << "[";
bool first = true;
for (T const &t: v) {
if (first) {
first = false;
} else {
out << ", ";
}
out << t;
}
out << "]";
return out;
}

int main(const int argc, const char* argv[]) {
if (argc < 4) {
std::cerr << "error: ./bench regexes user_agents repetitions [quiet]" << std::endl;
return 1;
}
bool quiet = argc == 5;

std::ifstream regexes_f(argv[1]);

re2::RE2::Options opt;
re2::FilteredRE2 f(3);
int id;

std::string line;

auto start = steady_clock::now();
while(std::getline(regexes_f, line)) {
re2::RE2::ErrorCode c;
if((c = f.Add(line, opt, &id))) {
std::cerr << "invalid regex " << line << std::endl;
return 1;
}
}
std::vector<std::string> to_match;
f.Compile(&to_match);
std::chrono::duration<float> diff = steady_clock::now() - start;
std::cerr << f.NumRegexps() << " regexes "
<< to_match.size() << " atoms"
<< " in " << diff.count() << "s"
<< std::endl;

opt.set_literal(true);
opt.set_case_sensitive(false);
start = steady_clock::now();
re2::RE2::Set s(opt, RE2::UNANCHORED);
for(auto const &atom: to_match) {
// can't fail since literals
assert(s.Add(atom, NULL) != -1);
}
assert(s.Compile());
diff = steady_clock::now() - start;
std::cerr << "\tprefilter built in " << diff.count() << "s" << std::endl;

start = steady_clock::now();
std::vector<std::string> user_agents;
std::ifstream user_agents_f(argv[2]);
while(std::getline(user_agents_f, line)) {
user_agents.push_back(line);
}
diff = steady_clock::now() - start;
std::cerr << user_agents.size()
<< " user agents in "
<< diff.count() << "s"
<< std::endl;

int repetitions = std::stoi(argv[3]);
std::vector<int> matching;
for(int x = 0; x < repetitions; ++x) {
for(size_t i = 0; i < user_agents.size(); ++i) {
auto& ua = user_agents[i];
matching.clear();
int n = s.Match(ua, &matching);
if (n) {
n = f.FirstMatch(ua, matching);
} else {
n = -1;
}
if (!quiet) {
if (n != -1) {
std::cout << std::setw(3) << n;
}
std::cout << std::endl;
}
}
}

return 0;
}
19 changes: 19 additions & 0 deletions regex-filtered/re2/flake.nix
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
description = "C++ re2 bench";

inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable";
};

outputs = inputs@{ flake-parts, ... }:
flake-parts.lib.mkFlake { inherit inputs; } {
systems = [ "x86_64-linux" "aarch64-linux" "aarch64-darwin" "x86_64-darwin" ];
perSystem = { config, self', inputs', pkgs, system, ... }: {
devShells.default = pkgs.mkShell {
packages = with pkgs; [
re2
];
};
};
};
}
Loading

0 comments on commit 1af15c1

Please sign in to comment.