Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step over surrogate pairs on zero-lenth matches (fixes #100134) #100482

Merged

Conversation

IllusionMH
Copy link
Contributor

@IllusionMH IllusionMH commented Jun 18, 2020

This PR fixes #100134

Problem lies in fact that regexp with u flag will change lastIndex to the start of surrogate pair if started after high pair:

const unicodeRe = /()/gu;
const regularRe = /()/g;
unicodeRe.lastIndex = 1;
regularRe.lastIndex = 1;
console.log('unicode', unicodeRe.lastIndex, unicodeRe.exec('💻').index, unicodeRe.lastIndex); // unicode 1 0 0
console.log('regular', regularRe.lastIndex, regularRe.exec('💻').index, regularRe.lastIndex); // regular 1 1 1

Which ended attempts to step forward for zero-length matches in endless loop.

This PR takes into account surrogate pairs and propagates lastIndex accordingly.
Endless loop during find matches is fixed in this PR.

Not in this PR:

  1. In some cases (for files with a lot of matches) it's impossible to use Find next button to get to the next match after surrogate pair. Can give a try in this PR if needed, but would prefer to leave it out of scope of this PR
    (new fallback logic not triggered in that case, but should use similar strategy - remember last position and advance for zero-length matches, but this should survive reset)
    What do you think?

  2. Inconsistency of handling matches of for surrogate pairs between JS RegExp's with u flag and ripgrep
    I can create issue with more details but simple example will be for file with 1💻 replace using RegExp 1. - Find in files breaks surrogate, while replace in editor replaces whole emoji

// we attempt to recover from that by advancing by one
this._searchRegex.lastIndex += 1;
// we attempt to recover from that by advancing by two if surrogate pair found and by one otherwise
if (strings.getNextCodePoint(text, textLength, this._searchRegex.lastIndex) > 0xFFFF) {
Copy link
Contributor Author

@IllusionMH IllusionMH Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks closest alternative to "missing" isSurrogatePair helper (without recreating logic that involves isHighSurrogate and isLowSurrogate)

@alexdima
Copy link
Member

Looks great! Thank you!

@alexdima alexdima added this to the June 2020 milestone Jun 22, 2020
@alexdima alexdima merged commit 60da9d8 into microsoft:master Jun 22, 2020
@IllusionMH IllusionMH deleted the search-step-over-surrogates-100134 branch June 22, 2020 20:57
@IllusionMH
Copy link
Contributor Author

@alexdima thank you for review!

Could you please provide more insights why when there is large number of matches in file (no in string with current match) it will re-evaluate matches (reset/next) starting from current match position?

Had no problem with small files with few matches - Find next can step over emoji and doesn't trigger reset/next

@github-actions github-actions bot locked and limited conversation to collaborators Aug 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regexp freezes VS Code when code has emoji
2 participants