Skip to content

Commit

Permalink
StrSearcher: Update substring search to use the Two Way algorithm
Browse files Browse the repository at this point in the history
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes rust-lang#25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).

Much thanks to @gereeter who helped me understand the reverse case
better and wrote the comment explaining `next_back` in the code.

I had quickcheck to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.

The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not much faster than the naive
implementation it replaces.
  • Loading branch information
Ulrik Sverdrup committed Jun 21, 2015
1 parent 9cc0b22 commit b890b7b
Show file tree
Hide file tree
Showing 3 changed files with 475 additions and 426 deletions.
10 changes: 9 additions & 1 deletion src/libcollectionstest/str.rs
Original file line number Diff line number Diff line change
Expand Up @@ -705,7 +705,7 @@ fn test_split_at() {
#[should_panic]
fn test_split_at_boundscheck() {
let s = "ศไทย中华Việt Nam";
let (a, b) = s.split_at(1);
s.split_at(1);
}

#[test]
Expand Down Expand Up @@ -1820,6 +1820,14 @@ mod pattern {
Match (4, 6),
Reject(6, 7),
]);
make_test!(str_searcher_ascii_haystack_seq, "bb", "abbcbbbbd", [
Reject(0, 1),
Match (1, 3),
Reject(3, 4),
Match (4, 6),
Match (6, 8),
Reject(8, 9),
]);
make_test!(str_searcher_empty_needle_ascii_haystack, "", "abbcbbd", [
Match (0, 0),
Reject(0, 1),
Expand Down
299 changes: 1 addition & 298 deletions src/libcore/str/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,12 @@
#![doc(primitive = "str")]
#![stable(feature = "rust1", since = "1.0.0")]

use self::OldSearcher::{TwoWay, TwoWayLong};
use self::pattern::Pattern;
use self::pattern::{Searcher, ReverseSearcher, DoubleEndedSearcher};

use char::CharExt;
use clone::Clone;
use cmp::{self, Eq};
use cmp::Eq;
use convert::AsRef;
use default::Default;
use fmt;
Expand All @@ -33,7 +32,6 @@ use option::Option::{self, None, Some};
use raw::{Repr, Slice};
use result::Result::{self, Ok, Err};
use slice::{self, SliceExt};
use usize;

pub mod pattern;

Expand Down Expand Up @@ -870,301 +868,6 @@ impl<'a> DoubleEndedIterator for LinesAny<'a> {
}
}

/// The internal state of an iterator that searches for matches of a substring
/// within a larger string using two-way search
#[derive(Clone)]
struct TwoWaySearcher {
// constants
crit_pos: usize,
period: usize,
byteset: u64,

// variables
position: usize,
memory: usize
}

/*
This is the Two-Way search algorithm, which was introduced in the paper:
Crochemore, M., Perrin, D., 1991, Two-way string-matching, Journal of the ACM 38(3):651-675.
Here's some background information.
A *word* is a string of symbols. The *length* of a word should be a familiar
notion, and here we denote it for any word x by |x|.
(We also allow for the possibility of the *empty word*, a word of length zero).
If x is any non-empty word, then an integer p with 0 < p <= |x| is said to be a
*period* for x iff for all i with 0 <= i <= |x| - p - 1, we have x[i] == x[i+p].
For example, both 1 and 2 are periods for the string "aa". As another example,
the only period of the string "abcd" is 4.
We denote by period(x) the *smallest* period of x (provided that x is non-empty).
This is always well-defined since every non-empty word x has at least one period,
|x|. We sometimes call this *the period* of x.
If u, v and x are words such that x = uv, where uv is the concatenation of u and
v, then we say that (u, v) is a *factorization* of x.
Let (u, v) be a factorization for a word x. Then if w is a non-empty word such
that both of the following hold
- either w is a suffix of u or u is a suffix of w
- either w is a prefix of v or v is a prefix of w
then w is said to be a *repetition* for the factorization (u, v).
Just to unpack this, there are four possibilities here. Let w = "abc". Then we
might have:
- w is a suffix of u and w is a prefix of v. ex: ("lolabc", "abcde")
- w is a suffix of u and v is a prefix of w. ex: ("lolabc", "ab")
- u is a suffix of w and w is a prefix of v. ex: ("bc", "abchi")
- u is a suffix of w and v is a prefix of w. ex: ("bc", "a")
Note that the word vu is a repetition for any factorization (u,v) of x = uv,
so every factorization has at least one repetition.
If x is a string and (u, v) is a factorization for x, then a *local period* for
(u, v) is an integer r such that there is some word w such that |w| = r and w is
a repetition for (u, v).
We denote by local_period(u, v) the smallest local period of (u, v). We sometimes
call this *the local period* of (u, v). Provided that x = uv is non-empty, this
is well-defined (because each non-empty word has at least one factorization, as
noted above).
It can be proven that the following is an equivalent definition of a local period
for a factorization (u, v): any positive integer r such that x[i] == x[i+r] for
all i such that |u| - r <= i <= |u| - 1 and such that both x[i] and x[i+r] are
defined. (i.e. i > 0 and i + r < |x|).
Using the above reformulation, it is easy to prove that
1 <= local_period(u, v) <= period(uv)
A factorization (u, v) of x such that local_period(u,v) = period(x) is called a
*critical factorization*.
The algorithm hinges on the following theorem, which is stated without proof:
**Critical Factorization Theorem** Any word x has at least one critical
factorization (u, v) such that |u| < period(x).
The purpose of maximal_suffix is to find such a critical factorization.
*/
impl TwoWaySearcher {
#[allow(dead_code)]
fn new(needle: &[u8]) -> TwoWaySearcher {
let (crit_pos_false, period_false) = TwoWaySearcher::maximal_suffix(needle, false);
let (crit_pos_true, period_true) = TwoWaySearcher::maximal_suffix(needle, true);

let (crit_pos, period) =
if crit_pos_false > crit_pos_true {
(crit_pos_false, period_false)
} else {
(crit_pos_true, period_true)
};

// This isn't in the original algorithm, as far as I'm aware.
let byteset = needle.iter()
.fold(0, |a, &b| (1 << ((b & 0x3f) as usize)) | a);

// A particularly readable explanation of what's going on here can be found
// in Crochemore and Rytter's book "Text Algorithms", ch 13. Specifically
// see the code for "Algorithm CP" on p. 323.
//
// What's going on is we have some critical factorization (u, v) of the
// needle, and we want to determine whether u is a suffix of
// &v[..period]. If it is, we use "Algorithm CP1". Otherwise we use
// "Algorithm CP2", which is optimized for when the period of the needle
// is large.
if &needle[..crit_pos] == &needle[period.. period + crit_pos] {
TwoWaySearcher {
crit_pos: crit_pos,
period: period,
byteset: byteset,

position: 0,
memory: 0
}
} else {
TwoWaySearcher {
crit_pos: crit_pos,
period: cmp::max(crit_pos, needle.len() - crit_pos) + 1,
byteset: byteset,

position: 0,
memory: usize::MAX // Dummy value to signify that the period is long
}
}
}

// One of the main ideas of Two-Way is that we factorize the needle into
// two halves, (u, v), and begin trying to find v in the haystack by scanning
// left to right. If v matches, we try to match u by scanning right to left.
// How far we can jump when we encounter a mismatch is all based on the fact
// that (u, v) is a critical factorization for the needle.
#[inline]
fn next(&mut self, haystack: &[u8], needle: &[u8], long_period: bool)
-> Option<(usize, usize)> {
'search: loop {
// Check that we have room to search in
if self.position + needle.len() > haystack.len() {
return None;
}

// Quickly skip by large portions unrelated to our substring
if (self.byteset >>
((haystack[self.position + needle.len() - 1] & 0x3f)
as usize)) & 1 == 0 {
self.position += needle.len();
if !long_period {
self.memory = 0;
}
continue 'search;
}

// See if the right part of the needle matches
let start = if long_period { self.crit_pos }
else { cmp::max(self.crit_pos, self.memory) };
for i in start..needle.len() {
if needle[i] != haystack[self.position + i] {
self.position += i - self.crit_pos + 1;
if !long_period {
self.memory = 0;
}
continue 'search;
}
}

// See if the left part of the needle matches
let start = if long_period { 0 } else { self.memory };
for i in (start..self.crit_pos).rev() {
if needle[i] != haystack[self.position + i] {
self.position += self.period;
if !long_period {
self.memory = needle.len() - self.period;
}
continue 'search;
}
}

// We have found a match!
let match_pos = self.position;
self.position += needle.len(); // add self.period for all matches
if !long_period {
self.memory = 0; // set to needle.len() - self.period for all matches
}
return Some((match_pos, match_pos + needle.len()));
}
}

// Computes a critical factorization (u, v) of `arr`.
// Specifically, returns (i, p), where i is the starting index of v in some
// critical factorization (u, v) and p = period(v)
#[inline]
#[allow(dead_code)]
#[allow(deprecated)]
fn maximal_suffix(arr: &[u8], reversed: bool) -> (usize, usize) {
let mut left: usize = !0; // Corresponds to i in the paper
let mut right = 0; // Corresponds to j in the paper
let mut offset = 1; // Corresponds to k in the paper
let mut period = 1; // Corresponds to p in the paper

while right + offset < arr.len() {
let a;
let b;
if reversed {
a = arr[left.wrapping_add(offset)];
b = arr[right + offset];
} else {
a = arr[right + offset];
b = arr[left.wrapping_add(offset)];
}
if a < b {
// Suffix is smaller, period is entire prefix so far.
right += offset;
offset = 1;
period = right.wrapping_sub(left);
} else if a == b {
// Advance through repetition of the current period.
if offset == period {
right += offset;
offset = 1;
} else {
offset += 1;
}
} else {
// Suffix is larger, start over from current location.
left = right;
right += 1;
offset = 1;
period = 1;
}
}
(left.wrapping_add(1), period)
}
}

/// The internal state of an iterator that searches for matches of a substring
/// within a larger string using a dynamically chosen search algorithm
#[derive(Clone)]
// NB: This is kept around for convenience because
// it is planned to be used again in the future
enum OldSearcher {
TwoWay(TwoWaySearcher),
TwoWayLong(TwoWaySearcher),
}

impl OldSearcher {
#[allow(dead_code)]
fn new(haystack: &[u8], needle: &[u8]) -> OldSearcher {
if needle.is_empty() {
// Handle specially
unimplemented!()
// FIXME: Tune this.
// FIXME(#16715): This unsigned integer addition will probably not
// overflow because that would mean that the memory almost solely
// consists of the needle. Needs #16715 to be formally fixed.
} else if needle.len() + 20 > haystack.len() {
// Use naive searcher
unimplemented!()
} else {
let searcher = TwoWaySearcher::new(needle);
if searcher.memory == usize::MAX { // If the period is long
TwoWayLong(searcher)
} else {
TwoWay(searcher)
}
}
}
}

#[derive(Clone)]
// NB: This is kept around for convenience because
// it is planned to be used again in the future
struct OldMatchIndices<'a, 'b> {
// constants
haystack: &'a str,
needle: &'b str,
searcher: OldSearcher
}

impl<'a, 'b> OldMatchIndices<'a, 'b> {
#[inline]
#[allow(dead_code)]
fn next(&mut self) -> Option<(usize, usize)> {
match self.searcher {
TwoWay(ref mut searcher)
=> searcher.next(self.haystack.as_bytes(), self.needle.as_bytes(), false),
TwoWayLong(ref mut searcher)
=> searcher.next(self.haystack.as_bytes(), self.needle.as_bytes(), true),
}
}
}

/*
Section: Comparing strings
*/
Expand Down
Loading

0 comments on commit b890b7b

Please sign in to comment.