Skip to content

Develop a Strategy for Efficiently Handling Non-Ascii Regexes #16

@hyperpape

Description

@hyperpape

The typical approach for determining the next state while looping through characters is to look it up from an array, after determining the byteClass for that character. We determine the byteClass by looking up the character in a byte[129] array. Schematically:

while (state != -1 && index < length) {
    character = string.chatAt(index++);
    byteClass = byteClasses[Math.min(128, character)];
    nextState = states[stateNumber][byteClass];
}

If a regex contains non-ascii characters, then we would force the byte array to be significantly larger, and it's less obvious that this is a good strategy. As of today, we do not ever use byteClasses to look up non-ascii characters.

There are a few approaches we can take here:

  1. Some regexes may use non-ascii characters, but only from the early parts of the BMP (a regex matching Cyrillic characters, for instance). In this case, we could simply use a byteClass array larger than 128 elements, without too much overhead. This would require a degree of measurement to determine at what point the increased memory use becomes unacceptable.

  2. If we can efficiently remap characters into a smaller space, we can keep our byteClass array small. If a regex contains contiguous characters from a single alphabet, then we could just use subtraction to map it into the [0,n] interval. How well this works in any given case depends on how complex the mapping would be.

  3. Re2 and Rust's regex library have state machines that operate a byte at a time. How this would work with Java's internal use of UTF-16, I'm not yet sure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions