Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StartsWith literal optimization for OrdinalIgnoreCase in RegexCompiler / source generator #66324

Closed
stephentoub opened this issue Mar 8, 2022 · 1 comment · Fixed by #66339
Assignees
Labels
area-System.Text.RegularExpressions tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Mar 8, 2022

#66095 is optimizing MemoryExtensions.StartsWith(span, "literal", OrdinalIgnoreCase) by auto-vectorizing cases where the literal string is all ASCII. In Regex, we can take advantage of this in the compiler and source generator by recognizing sequences of appropriate sets, e.g. "[Aa][Bb][Cc]", which our IgnoreCase option now produces (and will produce more robustly with #61048). Today we'll output individual checks for each set, e.g. (ch | 0x20) == 'a', but we can start outputting StartsWith calls with the corresponding multi-character literal string rather than performing individual comparisons against each char.

cc: @EgorBo, @joperezr

@stephentoub stephentoub added this to the 7.0.0 milestone Mar 8, 2022
@stephentoub stephentoub self-assigned this Mar 8, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 8, 2022
@ghost
Copy link

ghost commented Mar 8, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

#66095 is optimizing MemoryExtensions.StartsWith(span, "literal", OrdinalIgnoreCase) by auto-vectorizing cases where the literal string is all ASCII. In Regex, we can take advantage of this in the compiler and source generator by recognizing sequences of appropriate sets, e.g. "[Aa][Bb][Cc]", which our IgnoreCase option now produces (and will produce more robustly with #61048). Today we'll output individual checks for each set, e.g. (ch | 0x20) == 'a', but we can start outputting StartsWith calls with the corresponding multi-character literal string rather than performing individual comparisons against each char.

cc: @EgorBo, @joperezr

Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 8, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 10, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Apr 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions tenet-performance Performance related issue untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant