Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[browser][non-icu] HybridGlobalization checking for prefix/suffix #85093

Merged
merged 11 commits into from
Apr 22, 2023

Conversation

ilonatommy
Copy link
Member

@ilonatommy ilonatommy commented Apr 20, 2023

Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.

Old, icu-based private API: GlobalizationNative_StartsWith, GlobalizationNative_EndsWith

New, non-icu private API: Interop.JsGlobalization.StartsWith, Interop.JsGlobalization.EndsWith

Affected public API (see: tests in CompareInfoTests.IsPrefix.cs, CompareInfoTests.IsSuffix):

  • CompareInfo.IsPrefix
  • CompareInfo.IsSuffix
  • String.StartsWith
  • String.EndsWith

It is faster than existing ICU version (that does not support vectorization) but the supported scope is reduced.

Test name time ICU4C [ms] time HG [ms] increase by [times]
String, CompareInfo IsPrefix 5.3002 1.0729 0.20
String, CompareInfo IsSuffix 10.7907 1.0765 0.09
String, String StartsWith 5.4354 1.1588 0.21
String, String EndsWith 12.3411 1.1217 0.09

All changes in behavior are listed in docs\design\features\hybrid-globalization.md.

cc @SamMonoRT
cc @lewing: weightless chars removal is done by a regex after decoding the string

@ghost
Copy link

ghost commented Apr 20, 2023

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.

Old, icu-based private API: GlobalizationNative_StartsWith, GlobalizationNative_EndsWith

New, non-icu private API: Interop.JsGlobalization.StartsWith, Interop.JsGlobalization.EndsWith

Affected public API (see: tests in CompareInfoTests.IsPrefix.cs, CompareInfoTests.IsSuffix):

  • CompareInfo.IsPrefix
  • CompareInfo.IsSuffix
  • String.StartsWith
  • String.EndsWith

ToDo: paste perf test results here

All changes in behavior are listed in docs\design\features\hybrid-globalization.md.

cc @SamMonoRT
cc @lewing: weightless chars removal is done by a regex after decoding the string

Author: ilonatommy
Assignees: ilonatommy
Labels:

area-System.Globalization

Milestone: -

Copy link
Member

@pavelsavara pavelsavara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/mono/wasm/runtime/net6-legacy/hybrid-globalization.ts Outdated Show resolved Hide resolved
@ilonatommy
Copy link
Member Author

ilonatommy commented Apr 20, 2023

@pavelsavara
V8 does not have TextEncoder. Chrome has it, but it works different than

for (let i = 0; i < <any>end - <any>start; i += 2) {
        const char = Module.getValue(<any>start + i, "i16");
        str += String.fromCharCode(char);
}

and wrong (see failures: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-85093-merge-f3fb5a2097a34532ba/WasmTestOnBrowser-Hybrid.WASM.Tests/1/console.f54019e3.log?helixlogtype=result). From this reason, I am reverting the changes to decoding algorithm. I will use the above loop for every host type.

Edit: to check out more testcases that start failing when we use the Encoder see my attempt to use it in comparison in the PR: #85098

@tarekgh tarekgh added the arch-wasm WebAssembly architecture label Apr 20, 2023
@ghost
Copy link

ghost commented Apr 20, 2023

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Implements a chunk of web-api based globalization. Is a part of HybridGlobalization feature and contributes to #79989.

Old, icu-based private API: GlobalizationNative_StartsWith, GlobalizationNative_EndsWith

New, non-icu private API: Interop.JsGlobalization.StartsWith, Interop.JsGlobalization.EndsWith

Affected public API (see: tests in CompareInfoTests.IsPrefix.cs, CompareInfoTests.IsSuffix):

  • CompareInfo.IsPrefix
  • CompareInfo.IsSuffix
  • String.StartsWith
  • String.EndsWith

It is faster than existing ICU version (that does not support vectorization) but the supported scope is reduced.

| Test name | time ICU4C [ms] | time HG [ms] | increase by [times] |
|-:|-:|-:|-:|
| String, CompareInfo IsPrefix | 5.3002 | 4.1998 | 0.79 |
| String, CompareInfo IsSuffix | 10.7907 | 4.1528 | 0.38 |
| String, String StartsWith | 5.4354 | 4.2034 | 0.77 |
| String, String EndsWith | 12.3411 | 4.0324 | 0.32 |

The above are results on the version with TextEncoder. Current version results:

Test name time ICU4C [ms] time HG [ms] increase by [times]
String, CompareInfo IsPrefix 3.1098 2.8683 0.92
String, CompareInfo IsSuffix 3.0338 2.8077 0.92
String, String StartsWith 3.3057 2.8643 0.86
String, String EndsWith 3.6848 2.7959 0.75

All changes in behavior are listed in docs\design\features\hybrid-globalization.md.

cc @SamMonoRT
cc @lewing: weightless chars removal is done by a regex after decoding the string

Author: ilonatommy
Assignees: ilonatommy
Labels:

arch-wasm, area-System.Globalization

Milestone: -

@pavelsavara
Copy link
Member

TextEncoder ... wrong

Still LGTM, but I would love this to get explained.

  • What makes it wrong and why ?
  • Is is wrong also in other use-cases, such as JS interop ?

@ilonatommy
Copy link
Member Author

  • What makes it wrong and why ?
  • Is is wrong also in other use-cases, such as JS interop ?

I am in the process of #85098 investigation (there the only change is the decoding so it's easier). When I have conculsions, I'll post there and ping you.

@ilonatommy
Copy link
Member Author

ilonatommy commented Apr 21, 2023

I am in the process of #85098 investigation (there the only change is the decoding so it's easier). When I have conculsions, I'll post there and ping you.

The other PR had a logical error that was not connected with the way TextDecoder works. What is going on here:

  • we pass two strings: "%uD800 %uDC00" and "%uD800"
  • they arrive in JS correctly: as [55 296, 56 320] and [55 296]
  • decoder is able to decode the string with 2 codes but for 1 code it overflows:

image

65533 -> %uFFFD: Replacement character - used to replace an incoming character whose value is unknown or unrepresentable in Unicode.
Any code >= 55296 behaves this way, even though we can still fit them in 16 bits.
image

@ilonatommy
Copy link
Member Author

TextEnxcoder is fine, its behavior just does not match NLS's behavior. V8 behaves like Windows's NLS.

@ilonatommy ilonatommy merged commit fa5d12e into dotnet:main Apr 22, 2023
@ghost ghost locked as resolved and limited conversation to collaborators May 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants