Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globalization Size #225

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions accepted/2021/Globalization/GlobalizationSize.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# Globalization and Size on Disk

There are different discussions around the globalization support in the .NET mainly around the level of the globalization functionality and the size on disk that such features consume. There are some random thoughts around what we can do to address the raised issues.
The goal of this document is to list such issues and try to get all parties on the same page agreeing on the problem. The document tries to suggest different ways to address such issues which may help to form the final plan.

### Background

.NET supports different globalization functionality like date and number formatting, calendars, locales data, string collation, string casing, string normalization, Internationalizing Domain Names,...etc.
Almost all of such functionality needs globalization data to exist and code to handle such data to perform the expected operation.

Since .NET Framework v4.0, the framework mainly depends on the underlying operating system to perform such operations. Before v4.0 the framework was used to carry the globalization data and have code to handle such data (e.g. collations algorithms). Carrying the globalization data was a big burden on the framework because of the maintainability and servicing while the OS's and standards (e.g. Unicode/ICU) already providing all what we need.

In .NET Core, we continued with the same strategy. When running on Windows we depended on NLS Win32 APIs and when running on other platforms (e.g. Linux, Mac OS,...etc.) we depended on the [ICU](https://github.com/unicode-org/icu) library.

In .NET 5.0 we went one step further to [support using ICU library](https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu) when running on Windows and we made this behavior the default one. Also, we have supported the [ICU app-local](https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu#app-local-icu) feature which allows applications to use a different ICU version than the global system one if needed.

In .NET Core, we have supported the [Globalization Invariant Mode](https://github.com/dotnet/runtime/blob/main/docs/design/features/globalization-invariant-mode.md) is the mode that the application which doesn't care much about globalization can be configured with mode. The benefit of enabling this mode is avoiding any dependencies on the system or the ICU. The Invariant mode is mainly introduced to support smaller container images (e.g. Alpine Linux image). Another benefit of the Invariant Mode is guaranteeing consistent behavior of the app when running on different OS's or platforms.

### Concerned Scenarios

Although globalization support is working fine for desktop and server applications, it was concerning in other platforms mainly because it required installing ICU library. The ICU module sizes are concerning and wouldn't be acceptable especially when running on mobile platforms.

```
ICU module sizes from MS-ICU Linux x64 package:

libicudata.so.68.2.0.6 28,320,992
libicui18n.so.68.2.0.6 4,220,936
libicuuc.so.68.2.0.6 2,479,488
Copy link
Member

@jkotas jkotas Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the sizes of ICU globalization support for Xamarin apps and Wasm?

Copy link
Member

@filipnavara filipnavara Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iOS arm64

icudt.dat 1,713,152
icudt_CJK.dat 1,006,224
icudt_EFIGS.dat 602,096
icudt_no_CJK.dat 1,271,296
libicudata.a 720
libicui18n.a 3,536,472
libicuuc.a 2,286,376

Android arm64

icudt.dat 1,512,896
icudt_CJK.dat 966,080
icudt_EFIGS.dat 559,568
icudt_no_CJK.dat 1,082,128
libicudata.a 1,116
libicui18n.a 7,105,274
libicuuc.a 4,267,870

Browser WASM

icudt.dat 1,512,896
icudt_CJK.dat 966,080
icudt_EFIGS.dat 559,568
icudt_no_CJK.dat 1,082,128
libicui18n.a 4,077,896
libicuuc.a 2,510,674

The other architectures have similar sizes so I am omitting them for clarity. Only one of the icudt*.dat files is linked to the app. They are user selectable through .csproj property and contain differently trimmed configurations (full; Chinese/Japanese/Korean, English/French/Italian/German/Spanish, full without Chinese/Japanese/Korean). Notably it's weird that the iOS data file is bigger and it leads me to believe it's some build configuration error (cc @akoeplinger @directhex).

Copy link
Member

@filipnavara filipnavara Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not quite sure why the Android libraries would be twice as big on iOS. That would warranty some explanation too.

UPD: Looks like the Android ones may have debug symbols and the iOS ones don't. This is not 100% confirmed but stripping the Android ones reduces the size by around 50% and file output on the original files in libicu18n.a shows something like

> lib % file ios-arm64/native/lib/smpdtfmt.ao 
ios-arm64/native/lib/smpdtfmt.ao: Mach-O 64-bit object arm64
> lib % file android-arm64/native/lib/smpdtfmt.ao 
android-arm64/native/lib/smpdtfmt.ao: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, .a files is not what actually ships in the app. It would be more interesting to look at the size contribution after the .a files get statically linked into the app.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not trivial to measure but xamarin/xamarin-macios#10249 (comment) has some numbers. The .a files already contain heavily stripped ICU4C with disabled features.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm an idiot, the transport only contains .a files so we should be stripping whatever those get linked into post-hoc

Copy link

@CoffeeFlux CoffeeFlux Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native ICU bits end up being about 380KB compressed on wasm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more interesting to look at the size contribution after the .a files get statically linked into the app.

For Blazor WASM - when we go from the "default/full config" to the "min config" (which sets GlobalizationMode.Invariant, plus other things), dotnet.wasm drops from (all sizes are .br compressed) 737.0 KB to 384.0 KB.

@CoffeeFlux can get how much of that is just ICU, but my understanding is the majority of that size decrease is ICU getting linked out. This means the ICU library is contributing about as much code size as all of the mono runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note at least for WASM scenarios, I am seeing some people complains about some missing globalization data for their scenario. which suggest we need to have a better coverage or at least to give the option to have better coverage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll include the numbers reported here by @filipnavara and @CoffeeFlux to the doc. thanks for providing these.


Thanks to @filipnavara and @CoffeeFlux providing the following info:

iOS arm64:
icudt.dat 1,713,152
icudt_CJK.dat 1,006,224
icudt_EFIGS.dat 602,096
icudt_no_CJK.dat 1,271,296
libicudata.a 720
libicui18n.a 3,536,472
libicuuc.a 2,286,376

Android arm64:
icudt.dat 1,512,896
icudt_CJK.dat 966,080
icudt_EFIGS.dat 559,568
icudt_no_CJK.dat 1,082,128
libicudata.a 1,116
libicui18n.a 7,105,274
libicuuc.a 4,267,870

Browser WASM:
icudt.dat 1,512,896
icudt_CJK.dat 966,080
icudt_EFIGS.dat 559,568
icudt_no_CJK.dat 1,082,128
libicui18n.a 4,077,896
libicuuc.a 2,510,674

The native ICU bits end up being about 380KB compressed on wasm.

```

- **Mobile Platforms**:
- **Android**: Although Android OS comes with the ICU libraries which .NET can use, the OEMs can choose to not include such libraries in their device images.
- **iOS/MacCatalyst**: don't come with an ICU library that .NET can use. .NET had to provide an ICU package to be used in such platforms.
- **Web assembly**: The clients have to be a small size and need to run inside the browser which will not allow accessing any libraries outside the browser. That means WebAssembly clients need to include an ICU package for globalization support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extent do browsers provide globalization functionality that we could leverage without having to include it in the app?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lewing you may have some more info about that as you have investigated such options before. could you please share what you got during then?

tarekgh marked this conversation as resolved.
Show resolved Hide resolved
- Users of the **Alpine Linux containers** which enable `Globalization Invariant Mode` ask for better globalization support without increasing the image size much.
Copy link
Contributor

@marek-safar marek-safar Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about self-contained console apps, is it always desired to bundle full ICU?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the app scenario. If doesn't need Globalization features, then can enable Invariant mode. If it does need the globalization features, then need to bundle the ICU package. This is like what we have today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then need to bundle the ICU package

Right, which makes even trivial self-contained Console.WriteLine app over 30MB bigger

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you think trivial self-contained Console.WriteLine app will need globalization in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's the default and we don't make it anyway easy for developers to move away from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you like to see happen to make this easy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see several things we could do better

  • tell the developers which APIs or code patterns in their applications are incompatible fully or partially with invariant mode (different calendars, currency names access, string normalization, etc).
  • Write code fixers/suggestions to use Invariant culture or prefer APIs which are not culture-specific more aggressively.
  • Consider making InvarintMode default mode with an opt-out for some project types (e.g. console apps)


### Current Status

In .NET 5.0/6.0 we have introduced the ICU app-local feature which allowed the applications to publish any ICU package as part of the app. To solve the size issue of mobile apps and WebAssembly clients, Xamarin team has worked to create size-trimmed ICU packages targeting different mobile platforms and WebAssembly clients. Although this was helpful to move forward, it still takes a lot of effort and needs more to be done. It became clear trimming ICU4C is not easy. It is not as simple as just trimming the data but the ICU code size was big and needs trimming too. Trimming ICU code is not a simple task. The code that thought is isolated and trimmed, caused some problems later as it was used for other parts. Looks full code analysis needed to ensure whatever we trim would be safe and not needed for other parts.

One of the suggested thoughts is to try to enhance the Globalization Invariant mode and try to make it providing better globalization support. We already tracking doing some of this work in .NET 6.0 in the [issue](https://github.com/dotnet/runtime/issues/43774). Looking at the scenarios we want to enhance/support, I am not convinced enhancing the globalization mode is the path we need to pursue. The reason is, to address all concerned scenarios, we need to offer a range of different trimmed data and functionality options. For example, we need to offer packages for one or a group of cultures. Something like European-only cultures, CJK cultures, Bidi Cultures, full list of cultures...etc. Also, we need to provide functionality options e.g. Cultural collation support, Normalization support, IDN support, casing support...etc. If we can do that, the users can have the freedom to select the level of the functionality they want and would be clear the size cost for every choice. The users will have full control over the functionality level and the size cost. To support that, would make sense we follow the same ICU app-local idea and not just try to add more functionality to the Invariant mode. It would be better too to keep the definition of Globalization Invariant Mode clear and not confusing. This mode is for the applications that don't need globalization support and not for anything else.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one tell whether or not the app and its dependencies need the different functionality options, like normalization support, etc.? It is nice to give full control to users; but we also need to have a guidance how to take advantage of this support with confidence.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key. It's not obvious to developers when the various ICU components are used, so I think we will want some idea of how to actually go about changing that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one tell whether or not the app and its dependencies need the different functionality options, like normalization support, etc.? It is nice to give full control to users; but we also need to have a guidance how to take advantage of this support with confidence.

That is make sense:

  • The easy one would be the collation. If the user does not care about linguistic operation and ordinal operation is good enough then can exclude collation all together.
  • Normalization would be needed in advanced scenarios like transformation engines (e.g. text to speech, advanced DB searches, semantic string comparisons...etc.). Also, normalization is important part for IDN too.
  • IDN support, which need normalization too for advanced scenario (to fully support non-ASCII URLs).

I'll add this info to the doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem are nuget packages. How does the user knows what the nuget package needs?

For example, Uri parsing needs FormC normalization and I believe that it is slightly broken with invariant mode today. How does one discovers situations like this without trial and error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another simple case to answer: Does Maui need collation? If basic Maui does not need collation, are there Maui controls that do need collation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to do that is documenting what could bebroken when excluding any functionality. Maybe in the NuGet spec files itself.

The other idea would have only 2 types of packages. One with the full functionality and the other without. i.e. the second would be missing collation, normalization...etc. That would be closer to Invariant mode except the package will carry some locale data.

Unfortunately, I don't have solid data. so, we may delay support cutting functionality till we get concrete requests for that.

Copy link
Member Author

@tarekgh tarekgh Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another simple case to answer: Does Maui need collation? If basic Maui does not need collation, are there Maui controls that do need collation?

I believe the UI apps in general will need globalization support. Usually, apps have a user facing controls which has a list sorted in the user language. formatting and displaying dates and numbers is common too there. In general, I think Maui would be the first-class platform need globalization.

Copy link
Member

@jkotas jkotas Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

European-only cultures, CJK cultures, Bidi Cultures

These options have been available in Xamarin for a while. Do we have any data about how often app developers choose the leverage them? How common is it for Xamarin app developers to ship different builds of the app for different regions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marek-safar do we have such data? or do you know if anyone collected such data before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the scenarios we want to enhance/support, I am not convinced enhancing the globalization mode is the path we need to pursue.

Does this mean this document is suggesting / your recommendation is that we close dotnet/runtime#43774 and not do anything around improving the invariant mode? Having read the rest of the doc, I understand the recommendation is to pursue ICU4X for .NET 7, but I can't discern whether that's instead of or in addition to work for .NET 6.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the plan for .NET 6 was (at a minimum) to complete this part of dotnet/runtime#43774:

Make basic lower/upper casing of non-ASCII characters work. The experience is pretty bad today if you are non-English speaking and like to use your native alphabet. The size footprint for doing this does not seem prohibitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the meaning of "I am not convinced enhancing the globalization mode is the path we need to pursue." was about adding incremental features trying to get a hybrid between "Invariant mode" and "full globalization support". For example:

  • Supporting just locale Date/Number formatting data, but not the full-blown string collation

I think if we want to pursue some of these hybrid features, we would need user data/evidence that shows they would be useful to people.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant not try to add more globalization features to Invariant mode that include not adding even locale data. If we need to add locale data, we'll need to pack these in NuGet packages (as we need to avoid the servicing and maintainability of the data). If we use NuGet packages to support the data, that is mean we are really using app-locale feature and no reason to use Invariant mode in the first place.

I believe the ask in dotnet/runtime#43774 still reasonable for invariant mode. At least the first part as casing is not really core globalization more than string manipulation feature. For the second part, we will support it with a config switch which can be enabled only on the platforms that need it. That is the discussion we had.


### Apple OS's

Although we cannot access ICU directly on iOS or MacCatalyst, we may explore the path of using the OS Cocoa APIs directly. If this idea succeed we may consider it on MacOS too and not only for iOS or MacCatalyst.

@filipnavara has done the initial available API scan and here is the info:

```
GlobalizationNative_NormalizeString
- NSString decomposedStringWithCanonicalMapping (form D)
- NSString decomposedStringWithCompatibilityMapping (form KD)
- NSString precomposedStringWithCanonicalMapping (form C)
- NSString precomposedStringWithCompatibilityMapping (form KC)

GlobalizationNative_IsNormalized
- ?? (could be mapped to GlobalizationNative_NormalizeString but inefficient)

GlobalizationNative_WindowsIdToIanaId [Not supported on app-local iOS/WASM ICU]
GlobalizationNative_IanaIdToWindowsId [Not supported on app-local iOS/WASM ICU]

GlobalizationNative_GetTimeZoneDisplayName [Not supported on app-local iOS/WASM ICU]
- NSTimeZone name

GlobalizationNative_GetLocaleInfoString
- NSLocale
- LocaleString_LocalizedDisplayName => localizedStringForLocaleIdentifier
- LocaleString_EnglishDisplayName => localeIdentifier / localizedStringForLocaleIdentifier?
- LocaleString_NativeDisplayName => localeIdentifier / localizedStringForLocaleIdentifier?
- LocaleString_LocalizedLanguageName => localizedStringForLanguageCode
- LocaleString_EnglishLanguageName => localizedStringForLanguageCode
- LocaleString_NativeLanguageName => localizedStringForLanguageCode
- LocaleString_EnglishCountryName => localizedStringForCountryCode
- LocaleString_NativeCountryName => localizedStringForCountryCode
- LocaleString_DecimalSeparator => decimalSeparator
- LocaleString_ThousandSeparator => groupingSeparator
- LocaleString_Digits => ?
- LocaleString_MonetarySymbol => currencySymbol
- LocaleString_CurrencyEnglishName => localizedStringForCurrencyCode
- LocaleString_CurrencyNativeName => localizedStringForCurrencyCode
- LocaleString_Iso4217MonetarySymbol => currencySymbol (?)
- LocaleString_MonetaryDecimalSeparator => ?
- LocaleString_MonetaryThousandSeparator => ?
- LocaleString_AMDesignator => calendarIdentifier -> NSCalendar AMSymbol
- LocaleString_PMDesignator => calendarIdentifier -> NSCalendar PMSymbol
- LocaleString_PositiveSign => ?
- LocaleString_NegativeSign => ?
- LocaleString_Iso639LanguageTwoLetterName => languageCode
- LocaleString_Iso639LanguageThreeLetterName => ?
- LocaleString_Iso3166CountryName => countryCode (?)
- LocaleString_Iso3166CountryName2 => ?
- LocaleString_NaNSymbol => ?
- LocaleString_PositiveInfinitySymbol => ?
- LocaleString_ParentName => ?
- LocaleString_PercentSymbol => ?
- LocaleString_PerMilleSymbol => ?

GlobalizationNative_GetLocaleTimeFormat
- NSDateFormatter?

GlobalizationNative_GetLocaleInfoInt
- NSLocale

GlobalizationNative_GetLocaleInfoGroupingSizes
- ??

GlobalizationNative_GetLocales
GlobalizationNative_GetLocaleName
GlobalizationNative_GetDefaultLocaleName
GlobalizationNative_IsPredefinedLocale
- NSLocale

GlobalizationNative_ToAscii
GlobalizationNative_ToUnicode
- No equivalent API for IDN/Punycode?

GlobalizationNative_GetSortHandle
- return reference to NSLocale

GlobalizationNative_CloseSortHandle
- release reference to NSLocale

GlobalizationNative_GetSortVersion
- ??

GlobalizationNative_CompareString
- [NSString compare:options:range:locale:](https://developer.apple.com/documentation/foundation/nsstring/1414561-compare?language=objc)

GlobalizationNative_IndexOf
- [NSString rangeOfString:options:range:locale:](https://developer.apple.com/documentation/foundation/nsstring/1417348-rangeofstring?language=objc)

GlobalizationNative_LastIndexOf
- same as GlobalizationNative_IndexOf w/ NSBackwardsSearch

GlobalizationNative_StartsWith
- can be implemented trough GlobalizationNative_CompareString?

GlobalizationNative_EndsWith
- can be implemented trough GlobalizationNative_CompareString?

GlobalizationNative_GetSortKey
- wcsxfrm_l?

GlobalizationNative_ChangeCase
- NSString uppercaseStringWithLocale
- NSString lowercaseStringWithLocale

GlobalizationNative_ChangeCaseInvariant
- NSString uppercaseString
- NSString lowercaseString

GlobalizationNative_ChangeCaseTurkish
- Implemented in S.G.N using u_tolower/u_toupper, so easy to replicate

GlobalizationNative_InitOrdinalCasingPage
- Implemented in S.G.N using u_toupper, so easy to replicate

GlobalizationNative_GetCalendars
GlobalizationNative_GetCalendarInfo
GlobalizationNative_EnumCalendarInfo
GlobalizationNative_GetLatestJapaneseEra
GlobalizationNative_GetJapaneseEraStartDate
- NSCalendar (TBD: details)
```

### ICU4X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see a paragraph about high-level goal for globalization dependencies for the future. Few ideas what could be covered

  • Will ICU be always required dependency
  • What pay-as-you-go schema are we shooting for
  • Will we add any tooling/logic to collect used cultures
  • etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better if we first agree on the goals. the discussion #225 (comment) is about that and I love to hear your opinion about Jan's suggestion.


It is interesting enough, there is a newly launched open-source project called [ICU4X](https://github.com/unicode-org/icu4x#icu4x----) for introducing globalization support libraries (similar to ICU). ICU4X has the exact goals we need to achieve with our concerned scenarios.

```
ICU4X will provide an ECMA-402-compatible API surface in the target client-side platforms, including the web platform, iOS, Android, WearOS, WatchOS, Flutter, and Fuchsia, supported in programming languages including Rust, JavaScript, Objective-C, Java, Dart, and C++.

The design goals of ICU4X are:
- Small and modular code
- Pluggable locale data
- Availability and ease of use in multiple programming languages
- Written by i18n experts to encourage best practices
```

That is exactly what we need (or what we are trying to achieve).
Eric Erhardt already did some investigation and arranged a meeting with the project committee to learn more about the project and the status of the project. It was very helpful getting in touch and learning more about this project. It is a promising project managed by experts and using the CLDR data which means it still sticking with the standards.

The catch here is ICU4X still in the early stages and still not providing all functionality needed by .NET for globalization support. The project committee is willing to get and prioritize requests from different parties. We already communicated in the meeting what functionality .NET currently using from ICU4C and want to see that supported in ICU4X. For example, collation support is one of the topics we brought to the committee's attention.

### Plan Suggestion

ICU4X is the promising path we should pursue to address all concerned scenarios. We can wait a little bit more to get the needed missing functionality implemented in the project and integrate it to .NET as we did with ICU4C.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see more detailed analyzes on where ICU4X can actually help. From the current description, we are trading one generic library with different generic library.

Copy link
Member Author

@tarekgh tarekgh Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the ICU4X listed goals not mentioning that?

The design goals of ICU4X are:
- Small and modular code
- Pluggable locale data
- Availability and ease of use in multiple programming languages
- Written by i18n experts to encourage best practices

I'll add a paragraph describing the differences and gain we get when using ICU4X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a paragraph describing the differences and gain we get when using ICU4X

It'd be also helpful where their goals align with ours and where don't

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be also helpful where their goals align with ours and where don't

(ICU4X core team member) - our goals can be fine tuned to match the needs of the industry. The project is young and we're intentionally flexible. I'd love to see such analysis, but I'd also like to suggest that once you identify any misalignments, consider discussing them with us for potential re-alignment by extension of our goals to supply your needs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we plan to be a consumer of ICU4x, should we also plan on being on investor? Where are the resources for ICU4X coming from?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU4X current effort is driven by engineers from Google and Mozilla under the umbrella of a Unicode Charter.

We're curating the project for easy onboarding and put effort to stay inclusive for new contributors.
You can find the PRD and 1.0 Roadmap and if you'd like to contribute anything else, we'd be excited to work with you and incorporate it into our planning.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reason for doing this is performance / size. As such, this should be heavily backed by numbers. I think the priority should be to get estimates for where we may land by doing this work.

  • The code size for full .NET globalization support: current stripped ICU vs. estimate for ICU4X

  • The data size for full .NET globalization support: current stripped ICU vs. estimate for ICU4X

  • The code modularity: pick a flagship scenario for code modularity, and show current vs. estimate for ICU4X

  • The data modularity: pick a flagship scenario for data modularity, and show current vs. estimate for ICU4X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimate for ICU4X

Unfortunately, this is close to impossible at the current time. The features we need just aren't there. So getting any sort of size estimate is purely guessing. We would need to implement the features first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they have some interesting large chunk implemented that we can triangulate with the same chunk of classic ICU?

It is hard to tell whether ICU4X is promissing path without any numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the features on their releases, DateTime formatting might be a candidate (that's what I was trying to prototype against a few months ago), but we don't really use ICU for the formatting logic, just to get the data.

Copy link

@CoffeeFlux CoffeeFlux Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it might be possible to get numbers for something but I'm not sure they would be very meaningful at this point. The ICU4X project, given its current state, is inherently speculative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas is Blazor in the first category for you?

Yes, I would expect typical Blazor app to be in the first category.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is Shane from the ICU4X team. I'm happy to get you ballpark figures for code+data size of ICU4C vs ICU4X.

Copy link

@zbraniecki zbraniecki Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mozilla I18n Team is currently running an early signal performance testing of ICU4C vs ICU4X for our JS engine and internal needs. We hope to be able to provide some integration experience and performance/memory numbers within a month. If you're interested, we will be hosting a Deep Dive session in July where we plan to discuss the results (we'll also post them in ICU4X repository).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And, to follow up from above, I posted some ballpark code size figures here:

unicode-org/icu4x#788

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danroth27 can you share more about the globalization scenarios you expect from Blazor users? How many will care enough about size to want to have less than full globalization? And for those, how many would be satisfied with minimal/no globalization support vs configuring some custom subset if available.

Web developers are generally very sensitive to app download size and front end frameworks compete to be as small as possible. So in general I believe web devs will want just enough functionality to run their app and nothing more. There is a subset of web scenarios where app download size is less important, but it is the minority scenario. I suspect there are a significant number of web scenarios that could get by with just invariant mode and some light weight globalization helpers based on browser platform APIs, but @jkotas is probably right that the majority of web apps need broad globalization support, at least enough to handle their target audiences.

- We need to look more at what is currently supported functionality and what is missing so we can have the full list of features we need to get.
- We need to be in touch with the ICU4X committee communicating our requests and understanding when such features can be available.
- Need to look if we can associate some resources to help with that project especially with the missing features we need in the .NET.
- Need to look at the scope of work to integrate ICU4X to .NET runtime.
- Need to look if we can have a process creating different NuGet packages from the ICU4X repo with data and code customization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU4X is written in Rust. I am sure that it will create its share of interesting toolchain problems.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We put together a minimal proof of concept calling into icu4x from wasm, so it is at least possible (though we ran into a solvable problem with separate allocators fighting each other).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sure that it will create its share of interesting toolchain problems.

It does - I can confirm that from my prototyping.

One mitigation is that we can build the pieces of ICU4X separately from dotnet/runtime, just like how we consume ICU4C today.

For WASM - a problem I immediately hit was memory allocator differences between ICU4X and mono. We would need to solve that issue in order to use any Rust library in Blazor WASM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eerhardt I recall when we talked to the ICU4X committee they mentioned we can customize the memory allocator too. If that true, so we have a way to make this work. right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tarek, that's correct. Manish mentioned a way to tell it to use the global allocator, but I don't think anyone has actually spent more time on the prototype to actually validate that. I wouldn't consider it a major concern.


.NET 7.0 would be the best to start invest in that and try to start consuming at least the available parts of ICU4X.

### Alternative Plans
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other alternatives? What about mixing native APIs with custom code/data or using custom compression for the data as another options?

Copy link
Member Author

@tarekgh tarekgh Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you mean by custom comparison? Do you mean we implement Unicode Collation Algorithm ourselves? I would strongly oppose that as it is complicated and will need a huge effort to maintain it. ICU spent long time refining it to get it to the current shape. or you have other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"custom compression", not "custom comparison".

I was actually wondering whether the data are compressed and how. Maybe that would be an explanation for the difference in the data files size between the Andorid/iOS builds if a different compression level/library was used.

Mixing native APIs (as mentioned by @spouliot above) may actually be a way to explore. Sure, you may get slightly different behavior but likely the difference will be smaller than NLS->ICU switch since the underlying iOS API use the ICU data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, thank you. I misread it. sorry about that.


Other ideas would be similar to what the ICU4X is trying to do but maybe with scoped level. Here are the options we can try if we didn't go with ICU4X:

- Invest more in ICU4C trimming. That will need spending more resources doing code analysis and figuring out how we may trim the ICU code to the level that can satisfy the size requirements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can reject that as an alternative plan. We already spent months of work on this with no measurable improvements.

- Write a code wrapper around the CLDR or ICU data. So we'll not use ICU code to access the data. This can work for locale data but I don't think that will be a good option for other functionality like collation as the code is more complicated and not easy to re-implement.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU is part of the Mac ecosystem. However we cannot directly use it, it's considered private (and using it would result in app rejections).

However it means that for macOS (including Catalyst), iOS, tvOS (and even watchOS) Apple provide their own API built on top of ICU (code and or data).

It might be possible to replace the globalization code to use those API (for the mentioned OS) instead of providing both ICU code and data.

Since most of the code (and data) is part of the OS then the resulting app size should be reduced and closer to the "legacy" numbers. E.g. xamarin/xamarin-macios#10249 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the problem with this tends to be APIs not lining up or returning different data from what we've established as the standard on other platforms?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Mac APIs at the end depend on the ICU, then I expect the behavior will be close as if we use ICU directly.

@CoffeeFlux do you have more information about the differences?

Copy link
Member

@filipnavara filipnavara Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a very rough look at how the Cocoa API surface maps to the System.Globalization.Native API surface. It seems to be quite promising even if not a full match. The missing parts could probably be filled in with custom data.


GlobalizationNative_NormalizeString

  • NSString decomposedStringWithCanonicalMapping (form D)
  • NSString decomposedStringWithCompatibilityMapping (form KD)
  • NSString precomposedStringWithCanonicalMapping (form C)
  • NSString precomposedStringWithCompatibilityMapping (form KC)

GlobalizationNative_IsNormalized

  • ?? (could be mapped to GlobalizationNative_NormalizeString but inefficient)

GlobalizationNative_WindowsIdToIanaId [Not supported on app-local iOS/WASM ICU]
GlobalizationNative_IanaIdToWindowsId [Not supported on app-local iOS/WASM ICU]

GlobalizationNative_GetTimeZoneDisplayName [Not supported on app-local iOS/WASM ICU]

  • NSTimeZone name

GlobalizationNative_GetLocaleInfoString

  • NSLocale
    • LocaleString_LocalizedDisplayName => localizedStringForLocaleIdentifier
    • LocaleString_EnglishDisplayName => localeIdentifier / localizedStringForLocaleIdentifier?
    • LocaleString_NativeDisplayName => localeIdentifier / localizedStringForLocaleIdentifier?
    • LocaleString_LocalizedLanguageName => localizedStringForLanguageCode
    • LocaleString_EnglishLanguageName => localizedStringForLanguageCode
    • LocaleString_NativeLanguageName => localizedStringForLanguageCode
    • LocaleString_EnglishCountryName => localizedStringForCountryCode
    • LocaleString_NativeCountryName => localizedStringForCountryCode
    • LocaleString_DecimalSeparator => decimalSeparator
    • LocaleString_ThousandSeparator => groupingSeparator
    • LocaleString_Digits => ?
    • LocaleString_MonetarySymbol => currencySymbol
    • LocaleString_CurrencyEnglishName => localizedStringForCurrencyCode
    • LocaleString_CurrencyNativeName => localizedStringForCurrencyCode
    • LocaleString_Iso4217MonetarySymbol => currencySymbol (?)
    • LocaleString_MonetaryDecimalSeparator => ?
    • LocaleString_MonetaryThousandSeparator => ?
    • LocaleString_AMDesignator => calendarIdentifier -> NSCalendar AMSymbol
    • LocaleString_PMDesignator => calendarIdentifier -> NSCalendar PMSymbol
    • LocaleString_PositiveSign => ?
    • LocaleString_NegativeSign => ?
    • LocaleString_Iso639LanguageTwoLetterName => languageCode
    • LocaleString_Iso639LanguageThreeLetterName => ?
    • LocaleString_Iso3166CountryName => countryCode (?)
    • LocaleString_Iso3166CountryName2 => ?
    • LocaleString_NaNSymbol => ?
    • LocaleString_PositiveInfinitySymbol => ?
    • LocaleString_ParentName => ?
    • LocaleString_PercentSymbol => ?
    • LocaleString_PerMilleSymbol => ?

GlobalizationNative_GetLocaleTimeFormat

  • NSDateFormatter?

GlobalizationNative_GetLocaleInfoInt

  • NSLocale

GlobalizationNative_GetLocaleInfoGroupingSizes

  • ??

GlobalizationNative_GetLocales
GlobalizationNative_GetLocaleName
GlobalizationNative_GetDefaultLocaleName
GlobalizationNative_IsPredefinedLocale

  • NSLocale

GlobalizationNative_ToAscii
GlobalizationNative_ToUnicode

  • No equivalent API for IDN/Punycode?

GlobalizationNative_GetSortHandle

  • return reference to NSLocale

GlobalizationNative_CloseSortHandle

  • release reference to NSLocale

GlobalizationNative_GetSortVersion

  • ??

GlobalizationNative_CompareString

GlobalizationNative_IndexOf

GlobalizationNative_LastIndexOf

  • same as GlobalizationNative_IndexOf w/ NSBackwardsSearch

GlobalizationNative_StartsWith

  • can be implemented trough GlobalizationNative_CompareString?

GlobalizationNative_EndsWith

  • can be implemented trough GlobalizationNative_CompareString?

GlobalizationNative_GetSortKey

  • wcsxfrm_l?

GlobalizationNative_ChangeCase

  • NSString uppercaseStringWithLocale
  • NSString lowercaseStringWithLocale

GlobalizationNative_ChangeCaseInvariant

  • NSString uppercaseString
  • NSString lowercaseString

GlobalizationNative_ChangeCaseTurkish

  • Implemented in S.G.N using u_tolower/u_toupper, so easy to replicate

GlobalizationNative_InitOrdinalCasingPage

  • Implemented in S.G.N using u_toupper, so easy to replicate

GlobalizationNative_GetCalendars
GlobalizationNative_GetCalendarInfo
GlobalizationNative_EnumCalendarInfo
GlobalizationNative_GetLatestJapaneseEra
GlobalizationNative_GetJapaneseEraStartDate

  • NSCalendar (TBD: details)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good input to have. This path is worth to explore more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most importantly it seems to cover the collations which are hard to implement. The things I could not find at first glance seem to be simple data retrieval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filipnavara I'll add some section to the doc for the option trying to use the Cocoa APIs for Apple OS's.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to replace the globalization code to use those API (for the mentioned OS) instead of providing both ICU code and data.

Based on Mozilla experience, one consideration I must share is that doing that will lead to your software having diverging capabilities between platforms where you use ICU4C that you vendor in, vs platforms where you rely on the OS one.

ICU API does change, and in particular active large scale software often goes through a cycle where a need for an ICU4C is identitied, requested, supplied and then if the ICU4C can be vendored in, the next release contains new code. If you rely on OS ICU, the path may be extended to many years of wait which may be a late realization of a drag factor if you were to be affected.

Copy link
Member Author

@tarekgh tarekgh Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zbraniecki I agree with your consideration. What we are trying to do here is we weigh the options and see pros and cons for every option. But in general, if consistency across platforms and OS's is available option with reasonable cost for the scenarios we are targeting (e.g. mobile and WebAssembly scenarios) then that should be the option we go with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be expanded beyond a one-line description with a comparison against the other plans?


Any option we choose here will need to have an automated process to extract the needed data and code and pack it in a NuGet package. Also, whatever plan we consider, would be considered for the next release as this is not trivial work to do for .NET 6.0.