Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Both re2/bench and examples/bench read a file of regexes (one per line) and a file of user agents (one per line), build a filtered-regex thingie, then for each for the first matching user agent look for the first matching regex. They both take a repeat count to go through the list multiple times, as the sample ~75k user agents is a bit short to get relevant data (both bench programs go through it in about half a second). This also amortises the setup cost compared to the processing cost, not that it's not relevant, but it's probably the least relevant. Measurements ============ Sadly, while the Rust API is short and sweet and convenient and C++ is a bit of a hellscape (I still need to find how to properly parse flags), it turns out re2 is *much* more efficient than this v0 of regex-filtered with regex-filtered running ~40% slower and needing ~40% more cycles which track (though it only retires 20% more instructions). Note that these are runs with 100 iterations in order to get a good enough sampling and suppress the setup cost, as the matching is what we really want. This is measured using `time(1)`[^1] on macOS 14.5 (Sonoma) with the `-l` option for expanded metadata: re2 --- ``` 46.99 real 46.87 user 0.02 sys 53379072 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 4031 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 50 involuntary context switches 555295294264 instructions retired 151305788537 cycles elapsed 46548672 peak memory footprint ``` regex-filtered -------------- ``` 64.95 real 64.67 user 0.02 sys 145571840 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 9021 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 53 involuntary context switches 656605947158 instructions retired 208754382042 cycles elapsed 142198656 peak memory footprint ``` regex-filtered is significantly slower ====================================== At glance it's not clear where: initially I was thinking it was in not reusing the matching atoms between runs, however in regex-filtered that's an iterator, so there's no difference there. The other transient allocations during matching should be: - A vec for the entry ids (misnamed `matched_atom_ids`), preallocated in neither. - A map for the work (`work`) -- this is initialised to an ordered set copy of `matched_atom_ids` so regex-filtered actually does that directly (though it then expands `matched_atom_ids`'s size to the number of entries, which is how `re2` preallocates `work`). - A map of propagation counts, preallocated to the entries count in both. - A matching regex set (`regexps_map`), preallocated to the total number of regex in. - A `regexps` vector to store and return both unfiltered and filtered regex indexes, preallocated to the combination in regex-filtered but not in re2, however they should both end up at the same size since the combination (`unfiltered` + `regexps_map`) is what both are putting in. OTOH re2 does use a dedicated and bespoke `SparseArray` for much of the per-match work (`IntMap` is a `SparseArray<int>`), and after looking closer there that turns out to be quite is relevant: `SparseArray` is essentially the same concept as `IndexSet` except it skips the hashing phase entirely by having a fixed size for the frontend, so there's a sparse array of size `len` and that just contains an index into the dense array, which keeps the ordering of items and allows fast iteration. - For `matched_atom_ids`, it's useful to cheaply dedup new entries, which need to be kept in order, the code is already using an `IndexSet` but it's using the default hash which is a huge overhead compared to "no hash", switching to `FxHash` or `NoHash` would likely a major improvement here barring implementing our own `SparseArray`. - For `count`, the `SparseArray` seems like an unnecessary complication altogether, this should just be a `vec` of size `entries.len()`, that removes useless overhead and that's it, furthermore *that* could be stack-allocated up to a limit (hello tinyvec), this should be a significant gain compared to our current hashmap I'm realising is *really* sub-par. - `regexps` would certainly benefit from the same as `matched_atom_ids` especially compared to the current `HashSet` regex-filtered needs 3x the memory ==================================== regex-filtered also needs a lot more memory (3x the original). This is mostly though not exclusively in the setup phase:, re2 has a peak rss of: - 42860544 with 0 iterations - 54083584 with 1 iteration - 53166080 with 10 iterations (there's some variations between run) regex-filtered has a peak rss of: - 110968832 with 0 iterations - 144326656 with 1 iteration - 144965632 with 10 iterations So after the first iteration both are mostly stable, and both grow by ~30% between the setup and the first iteration. That... might be in large part because regex-filtered uses `usize` while re2 works off of `int` indices actually, the difference is 2.5x so there's clearly additional space being lost somewhere, but the system works almost entirely off of indices (of entries, of atoms, of regexes) which would translate to a ~2x growth, the rest might be `regex::Regex` being larger than `re2::RE2` (to investigate). [^1] the difference is nowhere fine enough that we need something else to investigate it
- Loading branch information