Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Early Bail Out for Time-Consuming Regex Matches in pcre2 #2786

Closed
LangLangBart opened this issue Apr 23, 2024 · 4 comments
Closed

Comments

@LangLangBart
Copy link

Describe your feature request

When using rg with --engine pcre2 for regex patterns that include complex assertions such as lookaheads, the search can take an excessive amount of time, especially with very large input strings. It would be beneficial to have an option that allows rg to skip strings that exceed a certain processing time threshold, thus preventing long-running searches.

The rg command at the bottom of the code block below took around 10 minutes to complete.

command rg --version
# ripgrep 14.1.0

# features:-simd-accel,+pcre2
# simd(compile):+SSE2,+SSSE3,-AVX2
# simd(runtime):+SSE2,+SSSE3,-AVX2

# PCRE2 10.42 is available (JIT is available)

# Generate a random alphanumeric string and store it in a file
printf "%s" "$(LC_ALL=C tr -dc "[:alnum:]" </dev/urandom | head -c 700000)" >/tmp/rg_test
# Search the file using lookahead assertions
command rg --engine pcre2 '(?=.*test)(?=.*ripgrep)' /tmp/rg_test

An early bail-out option in rg would help users prioritize search speed when processing complex regex patterns on large data, avoiding long waits for unlikely matches.

Example Usage

command rg --engine pcre2 --max-match-time-pcre2 1000 '(?=.*test)(?=.*ripgrep)'  /tmp/rg_test

In this example, --max-match-time-pcre2 1000 is a new option that instructs rg to spend no more than 1000 milliseconds attempting to find a match in a line. If the match attempt exceeds this duration, rg will skip it.


Background Info

To provide some background information on why this feature request was created: After downloading the Mailing List Archive for zsh, rg was used to search for two words in the same line.

command rg --engine pcre2 '(?=.*export)(?=.*function)'

However, there is at least one base64 string with approximately 700,000 characters, which caused rg to take a considerable amount of time to process.
zsh.org/mla/workers/2015/msg03191.html


@ltrzesniewski
Copy link
Contributor

FWIW, here's a faster way to match two words in the same line: ^(?=.*?export)(?=.*?function)

@LangLangBart
Copy link
Author

^(?=.?export)(?=.?function)

It serves me well, thank you.

@BurntSushi
Copy link
Owner

BurntSushi commented Apr 23, 2024

Or also, rg export | rg function.

Anyway, PCRE2 doesn't support a way to say, "if the search is taking longer than X time, stop." I don't know of any regex engine that supports that. If you think for a moment about how something like that would be implemented, it's pretty clear why: imagine trying to check a timeout value when you're in the middle of some vectorized SIMD loop. It would trash performance. And this sort of timeout check would need to be in virtually every loop everywhere. Ain't going to happen.

PCRE2, like most regex engines, does expose other types of resource limits. But none of them really approximate "time."

Your best bet is to filter out longer lines yourself (rg -v '.{100}' or something), or use a technique that doesn't require using a backtracking engine. Hopefully some day #875 will happen and that will probably be the "right" way to solve your particular problem.

@ltrzesniewski
Copy link
Contributor

I don't mean to be pedantic, but I'll add a few comments since it's an interestic topic 🙂

The .NET regex engine supports timeouts (there's a constructor overload which takes a timeout parameter). This feature is mainly provided for searching with untrusted patterns in order to avoid DoS attacks. I haven't looked at how it's implemented but I believe a check is inserted when backtracking (and maybe in some other cases such as lookarounds). The check is simple enough but it still needs to retrieve the clock when enabled. In any case this article states:

Regex supports timeouts, and guarantees that it will only do at most O(n) work (where n is the length of the input) between timeout checks

As for PCRE2, it has a PCRE2_AUTO_CALLOUT flag which causes the engine to call a user-provided function before each pattern "item", and you can cancel execution from there, so in theory you could implement timeouts, but that would surely wreck the matching performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants