Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major literal optimization refactoring. #190

Merged
merged 1 commit into from
Mar 28, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ finite automata and guarantees linear time matching on all inputs.
aho-corasick = "0.5"
# For skipping along search text quickly when a leading byte is known.
memchr = "0.1"
# For managing regex caches quickly across multiple threads.
mempool = "0.2"
# For parsing regular expressions.
regex-syntax = { path = "regex-syntax", version = "0.3.0" }
# For compiling UTF-8 decoding into automata.
Expand Down
64 changes: 38 additions & 26 deletions HACKING.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ The code to find prefixes and search for prefixes is in src/literals.rs. When
more than one literal prefix is found, we fall back to an Aho-Corasick DFA
using the aho-corasick crate. For one literal, we use a variant of the
Boyer-Moore algorithm. Both Aho-Corasick and Boyer-Moore use `memchr` when
appropriate.
appropriate. The Boyer-Moore variant in this library also uses elementary
frequency analysis to choose the write byte to run `memchr` with.

Of course, detecting prefix literals can only take us so far. Not all regular
expressions have literal prefixes. To remedy this, we try another approach to
Expand All @@ -53,10 +54,12 @@ text results in at most one new DFA state. It is made fast by caching states.
DFAs are susceptible to exponential state blow up (where the worst case is
computing a new state for every input byte, regardless of what's in the state
cache). To avoid using a lot of memory, the lazy DFA uses a bounded cache. Once
the cache is full, it is wiped and state computation starts over again.
the cache is full, it is wiped and state computation starts over again. If the
cache is wiped too frequently, then the DFA gives up and searching falls back
to one of the aforementioned algorithms.

All of the above matching engines expose precisely the matching semantics. This
is indeed tested. (See the section below about testing.)
All of the above matching engines expose precisely the same matching semantics.
This is indeed tested. (See the section below about testing.)

The following sub-sections describe the rest of the library and how each of the
matching engines are actually used.
Expand All @@ -70,6 +73,9 @@ encountered. Parsing is done in a separate crate so that others may benefit
from its existence, and because it is relatively divorced from the rest of the
regex library.

The regex-syntax crate also provides sophisticated support for extracting
prefix and suffix literals from regular expressions.

### Compilation

The compiler is in src/compile.rs. The input to the compiler is some abstract
Expand Down Expand Up @@ -162,7 +168,7 @@ knows what the caller wants. Using this information, we can determine which
engine (or engines) to use.

The logic for choosing which engine to execute is in src/exec.rs and is
documented on the Exec type. Exec values collection regular expression
documented on the Exec type. Exec values contain regular expression
Programs (defined in src/prog.rs), which contain all the necessary tidbits
for actually executing a regular expression on search text.

Expand All @@ -172,6 +178,14 @@ of src/exec.rs by far is the execution of the lazy DFA, since it requires a
forwards and backwards search, and then falls back to either the NFA algorithm
or backtracking if the caller requested capture locations.

The parameterization of every search is defined in src/params.rs. Among other
things, search parameters provide storage for recording capture locations and
matches (for regex sets). The existence and nature of storage is itself a
configuration for how each matching engine behaves. For example, if no storage
for capture locations is provided, then the matching engines can give up as
soon as a match is witnessed (which may occur well before the leftmost-first
match).

### Programs

A regular expression program is essentially a sequence of opcodes produced by
Expand Down Expand Up @@ -268,48 +282,46 @@ N.B. To run tests for the `regex!` macro, use:

The benchmarking in this crate is made up of many micro-benchmarks. Currently,
there are two primary sets of benchmarks: the benchmarks that were adopted at
this library's inception (in `benches/bench.rs`) and a newer set of benchmarks
this library's inception (in `benches/src/misc.rs`) and a newer set of benchmarks
meant to test various optimizations. Specifically, the latter set contain some
analysis and are in `benches/bench_sherlock.rs`. Also, the latter set are all
analysis and are in `benches/src/sherlock.rs`. Also, the latter set are all
executed on the same lengthy input whereas the former benchmarks are executed
on strings of varying length.

There is also a smattering of benchmarks for parsing and compilation.

Benchmarks are in a separate crate so that its dependencies can be managed
separately from the main regex crate.

Benchmarking follows a similarly wonky setup as tests. There are multiple
entry points:

* `bench_native.rs` - benchmarks the `regex!` macro
* `bench_dynamic.rs` - benchmarks `Regex::new`
* `bench_dynamic_nfa.rs` benchmarks `Regex::new`, forced to use the NFA
algorithm on every regex. (N.B. This can take a few minutes to run.)
* `bench_rust_plugin.rs` - benchmarks the `regex!` macro
* `bench_rust.rs` - benchmarks `Regex::new`
* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
* `bench_pcre.rs` - benchmarks PCRE
* `bench_onig.rs` - benchmarks Oniguruma

The PCRE benchmarks exist as a comparison point to a mature regular expression
library. In general, this regex library compares favorably (there are even a
few benchmarks that PCRE simply runs too slowly on or outright can't execute at
all). I would love to add other regular expression library benchmarks
(especially RE2), but PCRE is the only one with reasonable bindings.
The PCRE and Oniguruma benchmarks exist as a comparison point to a mature
regular expression library. In general, this regex library compares favorably
(there are even a few benchmarks that PCRE simply runs too slowly on or
outright can't execute at all). I would love to add other regular expression
library benchmarks (especially RE2).

If you're hacking on one of the matching engines and just want to see
benchmarks, then all you need to run is:

$ cargo bench --bench dynamic
$ ./run-bench rust

If you want to compare your results with older benchmarks, then try:

$ cargo bench --bench dynamic | tee old
$ ./run-bench rust | tee old
$ ... make it faster
$ cargo bench --bench dynamic | tee new
$ ./run-bench rust | tee new
$ cargo-benchcmp old new --improvements

The `cargo-benchcmp` utility is available here:
https://github.com/BurntSushi/cargo-benchcmp

To run the same benchmarks on PCRE, you'll need to use the sub-crate in
`regex-pcre-benchmark` like so:

$ cargo bench --manifest-path regex-pcre-benchmark/Cargo.toml

The PCRE benchmarks are separated from the main regex crate so that its
dependency doesn't break builds in environments without PCRE.
The `run-bench` utility can run benchmarks for PCRE and Oniguruma too. See
`./run-bench --help`.
1 change: 0 additions & 1 deletion benches/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ enum-set = "0.0.6"
lazy_static = "0.1"
onig = { version = "0.4", optional = true }
pcre = { version = "0.2", optional = true }
rand = "0.3"
regex = { version = "0.1", path = ".." }
regex_macros = { version = "0.1", path = "../regex_macros", optional = true }
regex-syntax = { version = "0.3", path = "../regex-syntax" }
Expand Down
1 change: 0 additions & 1 deletion benches/src/bench_onig.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@

#[macro_use] extern crate lazy_static;
extern crate onig;
extern crate rand;
extern crate test;

use std::ops::Deref;
Expand Down
1 change: 0 additions & 1 deletion benches/src/bench_pcre.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@
extern crate enum_set;
#[macro_use] extern crate lazy_static;
extern crate pcre;
extern crate rand;
extern crate test;

/// A nominal wrapper around pcre::Pcre to expose an interface similar to
Expand Down
1 change: 0 additions & 1 deletion benches/src/bench_rust.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
#![feature(test)]

#[macro_use] extern crate lazy_static;
extern crate rand;
extern crate regex;
extern crate regex_syntax;
extern crate test;
Expand Down
1 change: 0 additions & 1 deletion benches/src/bench_rust_bytes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
#![feature(test)]

#[macro_use] extern crate lazy_static;
extern crate rand;
extern crate regex;
extern crate regex_syntax;
extern crate test;
Expand Down
1 change: 0 additions & 1 deletion benches/src/bench_rust_plugin.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
#![plugin(regex_macros)]

#[macro_use] extern crate lazy_static;
extern crate rand;
extern crate regex;
extern crate regex_syntax;
extern crate test;
Expand Down
8 changes: 8 additions & 0 deletions benches/src/misc.rs
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,14 @@ bench_match!(one_pass_long_prefix_not, regex!("^.bcdefghijklmnopqrstuvwxyz.*$"),
"abcdefghijklmnopqrstuvwxyz".to_owned()
});

bench_match!(long_needle1, regex!("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"), {
repeat("a").take(100_000).collect::<String>() + "b"
});

bench_match!(long_needle2, regex!("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbba"), {
repeat("b").take(100_000).collect::<String>() + "a"
});

#[cfg(feature = "re-rust")]
#[bench]
fn replace_all(b: &mut Bencher) {
Expand Down
21 changes: 21 additions & 0 deletions benches/src/rust_compile.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ fn compile_simple_bytes(b: &mut Bencher) {
});
}

#[bench]
fn compile_simple_full(b: &mut Bencher) {
b.iter(|| {
regex!(r"^bc(d|e)*$")
});
}

#[bench]
fn compile_small(b: &mut Bencher) {
b.iter(|| {
Expand All @@ -45,6 +52,13 @@ fn compile_small_bytes(b: &mut Bencher) {
});
}

#[bench]
fn compile_small_full(b: &mut Bencher) {
b.iter(|| {
regex!(r"\p{L}|\p{N}|\s|.|\d")
});
}

#[bench]
fn compile_huge(b: &mut Bencher) {
b.iter(|| {
Expand All @@ -60,3 +74,10 @@ fn compile_huge_bytes(b: &mut Bencher) {
Compiler::new().bytes(true).compile(&[re]).unwrap()
});
}

#[bench]
fn compile_huge_full(b: &mut Bencher) {
b.iter(|| {
regex!(r"\p{L}{100}")
});
}
19 changes: 19 additions & 0 deletions regex-debug/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[package]
publish = false
name = "regex-debug"
version = "0.1.0"
authors = ["The Rust Project Developers"]
license = "MIT/Apache-2.0"
repository = "https://github.com/rust-lang/regex"
documentation = "http://doc.rust-lang.org/regex"
homepage = "https://github.com/rust-lang/regex"
description = "A tool useful for debugging regular expressions."

[dependencies]
docopt = "0.6"
regex = { version = "0.1", path = ".." }
regex-syntax = { version = "0.3", path = "../regex-syntax" }
rustc-serialize = "0.3"

[profile.release]
debug = true
Loading