Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the string search behavior when using ICU #57078

Merged
merged 8 commits into from
Aug 17, 2021

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Aug 9, 2021

Fixes #43736

In .NET 5.0, we have switched to use the ICU library for Globalization support. We got some issues with the string search operations like IndexOf, LastIndexOf, IsSuffix, and IsPrefix APIs. The linked issues containing some details about such problems. One example, when searching for "\n" in the string "\r\n" get string not found as a result.
The underlying cause of this behavior is ICU internally is using a break iterator which disallows breaking between some characters.
The provided solution here is we are customizing the break iterator to get the desired behavior.

@ghost
Copy link

ghost commented Aug 9, 2021

Tagging subscribers to this area: @tarekgh, @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #43736
Fixes #43897

In .NET 5.0, we have switched to use the ICU library for Globalization support. We got some issues with the string search operations like IndexOf, LastIndexOf, IsSuffix, and IsPrefix APIs. The linked issues containing some details about such problems. One example, when searching for "\n" in the string "\r\n" get string not found as a result.
The underlying cause of this behavior is ICU internally is using a break iterator which disallows breaking between some characters.
The provided solution here is we are customizing the break iterator to get the desired behavior.

Author: tarekgh
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@tarekgh tarekgh added this to the 6.0.0 milestone Aug 9, 2021
@tarekgh
Copy link
Member Author

tarekgh commented Aug 9, 2021

@safern could you please have a look?

CC @danmoseley @ericstj @stephentoub @GrabYourPitchforks

@tarekgh tarekgh self-assigned this Aug 9, 2021
@tarekgh tarekgh requested a review from safern August 9, 2021 21:58
@GrabYourPitchforks
Copy link
Member

Haven't looked at the code yet, but this seems like an extremely risky change to make right before an RC snap. Are we sure we want this for 6.0?

@tarekgh
Copy link
Member Author

tarekgh commented Aug 10, 2021

Haven't looked at the code yet, but this seems like an extremely risky change to make right before an RC snap. Are we sure we want this for 6.0?

Why you think it is that risky? I am not seeing the same. Even I think this can meet the serving bar if we need to service 5.0.

Note that this is affecting the breaking between the characters and not really the collation. That is why I am seeing the risk is not high here.

@tarekgh
Copy link
Member Author

tarekgh commented Aug 11, 2021

@GrabYourPitchforks just checking if you have any more comments?

@safern could you please have a look when you have a chance?

@GrabYourPitchforks
Copy link
Member

I don't have any other feedback. I still maintain that this change has the potential to introduce subtle breaks since we're introducing NLS-like behaviors across all platforms, even non-Windows platforms which always used ICU under the covers. And that normally we'd want such a change to bake for several previews before it goes in for real. I don't feel confident that we understand the potential fallout well enough to introduce this in RC, let alone in-place servicing for downlevel platforms.

All of that said, if we still believe it's the right change for the product, I'll accept it. The code looks solid enough. :)

@danmoseley
Copy link
Member

@stephentoub @marklio what are your thoughts on the breakingness here.

@danmoseley
Copy link
Member

I'd like to get more perspectives before merging, if thats OK?

@GrabYourPitchforks
Copy link
Member

@danmoseley Check your inbox. Tarek sent an excellent email that laid out the problem and why this solution was chosen. My response to that email boiled down to "let's make sure we understand these three things first, and if we're good with them then hit GO!"

@danmoseley danmoseley added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 12, 2021
@danmoseley
Copy link
Member

Marking while we cross i's and dot t's.

@tarekgh
Copy link
Member Author

tarekgh commented Aug 16, 2021

@aolszowka you have reacted to a couple of comments here. do you have any feedback we need to consider with this change? or you are just watching the progress?

@aolszowka
Copy link

@tarekgh

@aolszowka you have reacted to a couple of comments here. do you have any feedback we need to consider with this change? or you are just watching the progress?

I am just watching this progress. I reported one of the original repros on the ICU behaviors here: #43802. I no longer work for that particular company with that code base that would have been greatly affected.

I can say that these types of low level changes often result in extremely high profile breaks that doesn't generate a lot of good will at the C-Level, making the case for staying current on .NET difficult. From the community response in the various threads I don't think we're alone in that situation.

I would echo @GrabYourPitchforks comments, and would be interested to see what persuaded a change to say 'ship it':

I don't have any other feedback. I still maintain that this change has the potential to introduce subtle breaks since we're introducing NLS-like behaviors across all platforms, even non-Windows platforms which always used ICU under the covers. And that normally we'd want such a change to bake for several previews before it goes in for real. I don't feel confident that we understand the potential fallout well enough to introduce this in RC, let alone in-place servicing for downlevel platforms.

The feedback I have given to at least alert developers (at compile time) of these types of gotchas seems to have been back burner-ed #47775. In my opinion, in a perfect world these types of changes would roll out first as a compiler warning for a few months to encourage devs to study potential problem areas for remediation (or at least submit knowledge bases for areas to look out for) and then perform a runtime roll out. However historically these Analyzers have been an after thought, only after some hard-won in the wild knowledge has cause incidents.

@tarekgh tarekgh removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 17, 2021
@tarekgh
Copy link
Member Author

tarekgh commented Aug 17, 2021

The failure in the CI leg runtime (Libraries Test Run checked coreclr windows x86 Release) is unrelated to the change and tracked by #57452

@adamsitnik
Copy link
Member

This is pure ICU magic!

@ghost ghost locked as resolved and limited conversation to collaborators Sep 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Breaking change with string.IndexOf(string) from .NET Core 3.0 -> .NET 5.0
5 participants