Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inadvertently case sensitive Boyer-Moore #39420

Merged
merged 4 commits into from
Jul 16, 2020

Conversation

danmoseley
Copy link
Member

Fix #39390

In this case the pattern "H#" would not match "#H#" iff RegexOptions.IgnoreCase | RegexOptions.Compiled.

Because the pattern contains a literal prefix (indeed it is the entire pattern) we will use Boyer-Moore to find the first instance of it. (One could imagine a more efficient way to search for a 2-character prefix.) Because the IgnoreCase was passed, we lowercase the pattern immediately to "h#", and when we match against a character in the text, we must lower case that character to compare it.

As a performance optimization, in the Compiled path, we avoid calling ToLower on the text candidate if we can cheaply verify that the character we are searching for is not be affected by case conversion. In this case, for example, we need not bother to lower case the text candidate character when we are searching for "#" because it is in a UnicodeCategory ("OtherPunctuation") which we know is not affected by case conversion. This optimization, like many others, does not exist in the non Compiled path.

The bug was that when deciding whether to lowercase the text candidate, instead of examining the character we were searching for, we were examining the last character of the prefix instead. In this repro case that is "#" so when searching for "H" we would not lower case it.

I added a test that fails without this fix.

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@danmoseley
Copy link
Member Author

We need to get this into Preview 8.

@danmoseley
Copy link
Member Author

Separate from this PR, it would probably be good to add a test that searches for a random string against a text that may or may not contain it somewhere, and compare compiled with non compiled. Such a test could quickly have found this bug and might protect us against others. The comparison with non compiled is interesting because the implementation is so different.

@ghost
Copy link

ghost commented Jul 16, 2020

Tagging subscribers to this area: @eerhardt, @pgovind
Notify danmosemsft if you want to be subscribed.

Copy link
Contributor

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM!

@danmoseley danmoseley merged commit c9627a1 into dotnet:master Jul 16, 2020
@danmoseley danmoseley deleted the regexbug branch July 16, 2020 20:58
@karelz karelz added this to the 5.0.0 milestone Aug 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
@danmoseley danmoseley restored the regexbug branch December 22, 2020 05:07
@danmoseley danmoseley deleted the regexbug branch September 6, 2021 23:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected regex match failure with Compiled and IgnoreCase on net 5
5 participants