-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Globalization Size #225
base: main
Are you sure you want to change the base?
Globalization Size #225
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,235 @@ | ||
# Globalization and Size on Disk | ||
|
||
There are different discussions around the globalization support in the .NET mainly around the level of the globalization functionality and the size on disk that such features consume. There are some random thoughts around what we can do to address the raised issues. | ||
The goal of this document is to list such issues and try to get all parties on the same page agreeing on the problem. The document tries to suggest different ways to address such issues which may help to form the final plan. | ||
|
||
### Background | ||
|
||
.NET supports different globalization functionality like date and number formatting, calendars, locales data, string collation, string casing, string normalization, Internationalizing Domain Names,...etc. | ||
Almost all of such functionality needs globalization data to exist and code to handle such data to perform the expected operation. | ||
|
||
Since .NET Framework v4.0, the framework mainly depends on the underlying operating system to perform such operations. Before v4.0 the framework was used to carry the globalization data and have code to handle such data (e.g. collations algorithms). Carrying the globalization data was a big burden on the framework because of the maintainability and servicing while the OS's and standards (e.g. Unicode/ICU) already providing all what we need. | ||
|
||
In .NET Core, we continued with the same strategy. When running on Windows we depended on NLS Win32 APIs and when running on other platforms (e.g. Linux, Mac OS,...etc.) we depended on the [ICU](https://github.com/unicode-org/icu) library. | ||
|
||
In .NET 5.0 we went one step further to [support using ICU library](https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu) when running on Windows and we made this behavior the default one. Also, we have supported the [ICU app-local](https://docs.microsoft.com/en-us/dotnet/standard/globalization-localization/globalization-icu#app-local-icu) feature which allows applications to use a different ICU version than the global system one if needed. | ||
|
||
In .NET Core, we have supported the [Globalization Invariant Mode](https://github.com/dotnet/runtime/blob/main/docs/design/features/globalization-invariant-mode.md) is the mode that the application which doesn't care much about globalization can be configured with mode. The benefit of enabling this mode is avoiding any dependencies on the system or the ICU. The Invariant mode is mainly introduced to support smaller container images (e.g. Alpine Linux image). Another benefit of the Invariant Mode is guaranteeing consistent behavior of the app when running on different OS's or platforms. | ||
|
||
### Concerned Scenarios | ||
|
||
Although globalization support is working fine for desktop and server applications, it was concerning in other platforms mainly because it required installing ICU library. The ICU module sizes are concerning and wouldn't be acceptable especially when running on mobile platforms. | ||
|
||
``` | ||
ICU module sizes from MS-ICU Linux x64 package: | ||
|
||
libicudata.so.68.2.0.6 28,320,992 | ||
libicui18n.so.68.2.0.6 4,220,936 | ||
libicuuc.so.68.2.0.6 2,479,488 | ||
|
||
Thanks to @filipnavara and @CoffeeFlux providing the following info: | ||
|
||
iOS arm64: | ||
icudt.dat 1,713,152 | ||
icudt_CJK.dat 1,006,224 | ||
icudt_EFIGS.dat 602,096 | ||
icudt_no_CJK.dat 1,271,296 | ||
libicudata.a 720 | ||
libicui18n.a 3,536,472 | ||
libicuuc.a 2,286,376 | ||
|
||
Android arm64: | ||
icudt.dat 1,512,896 | ||
icudt_CJK.dat 966,080 | ||
icudt_EFIGS.dat 559,568 | ||
icudt_no_CJK.dat 1,082,128 | ||
libicudata.a 1,116 | ||
libicui18n.a 7,105,274 | ||
libicuuc.a 4,267,870 | ||
|
||
Browser WASM: | ||
icudt.dat 1,512,896 | ||
icudt_CJK.dat 966,080 | ||
icudt_EFIGS.dat 559,568 | ||
icudt_no_CJK.dat 1,082,128 | ||
libicui18n.a 4,077,896 | ||
libicuuc.a 2,510,674 | ||
|
||
The native ICU bits end up being about 380KB compressed on wasm. | ||
|
||
``` | ||
|
||
- **Mobile Platforms**: | ||
- **Android**: Although Android OS comes with the ICU libraries which .NET can use, the OEMs can choose to not include such libraries in their device images. | ||
- **iOS/MacCatalyst**: don't come with an ICU library that .NET can use. .NET had to provide an ICU package to be used in such platforms. | ||
- **Web assembly**: The clients have to be a small size and need to run inside the browser which will not allow accessing any libraries outside the browser. That means WebAssembly clients need to include an ICU package for globalization support. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To what extent do browsers provide globalization functionality that we could leverage without having to include it in the app? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lewing you may have some more info about that as you have investigated such options before. could you please share what you got during then?
tarekgh marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Users of the **Alpine Linux containers** which enable `Globalization Invariant Mode` ask for better globalization support without increasing the image size much. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about self-contained console apps, is it always desired to bundle full ICU? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It depends on the app scenario. If doesn't need Globalization features, then can enable Invariant mode. If it does need the globalization features, then need to bundle the ICU package. This is like what we have today. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, which makes even trivial self-contained Console.WriteLine app over 30MB bigger There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why you think trivial self-contained Console.WriteLine app will need globalization in the first place? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because it's the default and we don't make it anyway easy for developers to move away from it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would you like to see happen to make this easy? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can see several things we could do better
|
||
|
||
### Current Status | ||
|
||
In .NET 5.0/6.0 we have introduced the ICU app-local feature which allowed the applications to publish any ICU package as part of the app. To solve the size issue of mobile apps and WebAssembly clients, Xamarin team has worked to create size-trimmed ICU packages targeting different mobile platforms and WebAssembly clients. Although this was helpful to move forward, it still takes a lot of effort and needs more to be done. It became clear trimming ICU4C is not easy. It is not as simple as just trimming the data but the ICU code size was big and needs trimming too. Trimming ICU code is not a simple task. The code that thought is isolated and trimmed, caused some problems later as it was used for other parts. Looks full code analysis needed to ensure whatever we trim would be safe and not needed for other parts. | ||
|
||
One of the suggested thoughts is to try to enhance the Globalization Invariant mode and try to make it providing better globalization support. We already tracking doing some of this work in .NET 6.0 in the [issue](https://github.com/dotnet/runtime/issues/43774). Looking at the scenarios we want to enhance/support, I am not convinced enhancing the globalization mode is the path we need to pursue. The reason is, to address all concerned scenarios, we need to offer a range of different trimmed data and functionality options. For example, we need to offer packages for one or a group of cultures. Something like European-only cultures, CJK cultures, Bidi Cultures, full list of cultures...etc. Also, we need to provide functionality options e.g. Cultural collation support, Normalization support, IDN support, casing support...etc. If we can do that, the users can have the freedom to select the level of the functionality they want and would be clear the size cost for every choice. The users will have full control over the functionality level and the size cost. To support that, would make sense we follow the same ICU app-local idea and not just try to add more functionality to the Invariant mode. It would be better too to keep the definition of Globalization Invariant Mode clear and not confusing. This mode is for the applications that don't need globalization support and not for anything else. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does one tell whether or not the app and its dependencies need the different functionality options, like normalization support, etc.? It is nice to give full control to users; but we also need to have a guidance how to take advantage of this support with confidence. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the key. It's not obvious to developers when the various ICU components are used, so I think we will want some idea of how to actually go about changing that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That is make sense:
I'll add this info to the doc There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem are nuget packages. How does the user knows what the nuget package needs? For example, Uri parsing needs FormC normalization and I believe that it is slightly broken with invariant mode today. How does one discovers situations like this without trial and error? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another simple case to answer: Does Maui need collation? If basic Maui does not need collation, are there Maui controls that do need collation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only way to do that is documenting what could bebroken when excluding any functionality. Maybe in the NuGet spec files itself. The other idea would have only 2 types of packages. One with the full functionality and the other without. i.e. the second would be missing collation, normalization...etc. That would be closer to Invariant mode except the package will carry some locale data. Unfortunately, I don't have solid data. so, we may delay support cutting functionality till we get concrete requests for that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I believe the UI apps in general will need globalization support. Usually, apps have a user facing controls which has a list sorted in the user language. formatting and displaying dates and numbers is common too there. In general, I think Maui would be the first-class platform need globalization. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
These options have been available in Xamarin for a while. Do we have any data about how often app developers choose the leverage them? How common is it for Xamarin app developers to ship different builds of the app for different regions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @marek-safar do we have such data? or do you know if anyone collected such data before. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Does this mean this document is suggesting / your recommendation is that we close dotnet/runtime#43774 and not do anything around improving the invariant mode? Having read the rest of the doc, I understand the recommendation is to pursue ICU4X for .NET 7, but I can't discern whether that's instead of or in addition to work for .NET 6. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding of the plan for .NET 6 was (at a minimum) to complete this part of dotnet/runtime#43774:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the meaning of "I am not convinced enhancing the globalization mode is the path we need to pursue." was about adding incremental features trying to get a hybrid between "Invariant mode" and "full globalization support". For example:
I think if we want to pursue some of these hybrid features, we would need user data/evidence that shows they would be useful to people. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant not try to add more globalization features to Invariant mode that include not adding even locale data. If we need to add locale data, we'll need to pack these in NuGet packages (as we need to avoid the servicing and maintainability of the data). If we use NuGet packages to support the data, that is mean we are really using app-locale feature and no reason to use Invariant mode in the first place. I believe the ask in dotnet/runtime#43774 still reasonable for invariant mode. At least the first part as casing is not really core globalization more than string manipulation feature. For the second part, we will support it with a config switch which can be enabled only on the platforms that need it. That is the discussion we had. |
||
|
||
### Apple OS's | ||
|
||
Although we cannot access ICU directly on iOS or MacCatalyst, we may explore the path of using the OS Cocoa APIs directly. If this idea succeed we may consider it on MacOS too and not only for iOS or MacCatalyst. | ||
|
||
@filipnavara has done the initial available API scan and here is the info: | ||
|
||
``` | ||
GlobalizationNative_NormalizeString | ||
- NSString decomposedStringWithCanonicalMapping (form D) | ||
- NSString decomposedStringWithCompatibilityMapping (form KD) | ||
- NSString precomposedStringWithCanonicalMapping (form C) | ||
- NSString precomposedStringWithCompatibilityMapping (form KC) | ||
|
||
GlobalizationNative_IsNormalized | ||
- ?? (could be mapped to GlobalizationNative_NormalizeString but inefficient) | ||
|
||
GlobalizationNative_WindowsIdToIanaId [Not supported on app-local iOS/WASM ICU] | ||
GlobalizationNative_IanaIdToWindowsId [Not supported on app-local iOS/WASM ICU] | ||
|
||
GlobalizationNative_GetTimeZoneDisplayName [Not supported on app-local iOS/WASM ICU] | ||
- NSTimeZone name | ||
|
||
GlobalizationNative_GetLocaleInfoString | ||
- NSLocale | ||
- LocaleString_LocalizedDisplayName => localizedStringForLocaleIdentifier | ||
- LocaleString_EnglishDisplayName => localeIdentifier / localizedStringForLocaleIdentifier? | ||
- LocaleString_NativeDisplayName => localeIdentifier / localizedStringForLocaleIdentifier? | ||
- LocaleString_LocalizedLanguageName => localizedStringForLanguageCode | ||
- LocaleString_EnglishLanguageName => localizedStringForLanguageCode | ||
- LocaleString_NativeLanguageName => localizedStringForLanguageCode | ||
- LocaleString_EnglishCountryName => localizedStringForCountryCode | ||
- LocaleString_NativeCountryName => localizedStringForCountryCode | ||
- LocaleString_DecimalSeparator => decimalSeparator | ||
- LocaleString_ThousandSeparator => groupingSeparator | ||
- LocaleString_Digits => ? | ||
- LocaleString_MonetarySymbol => currencySymbol | ||
- LocaleString_CurrencyEnglishName => localizedStringForCurrencyCode | ||
- LocaleString_CurrencyNativeName => localizedStringForCurrencyCode | ||
- LocaleString_Iso4217MonetarySymbol => currencySymbol (?) | ||
- LocaleString_MonetaryDecimalSeparator => ? | ||
- LocaleString_MonetaryThousandSeparator => ? | ||
- LocaleString_AMDesignator => calendarIdentifier -> NSCalendar AMSymbol | ||
- LocaleString_PMDesignator => calendarIdentifier -> NSCalendar PMSymbol | ||
- LocaleString_PositiveSign => ? | ||
- LocaleString_NegativeSign => ? | ||
- LocaleString_Iso639LanguageTwoLetterName => languageCode | ||
- LocaleString_Iso639LanguageThreeLetterName => ? | ||
- LocaleString_Iso3166CountryName => countryCode (?) | ||
- LocaleString_Iso3166CountryName2 => ? | ||
- LocaleString_NaNSymbol => ? | ||
- LocaleString_PositiveInfinitySymbol => ? | ||
- LocaleString_ParentName => ? | ||
- LocaleString_PercentSymbol => ? | ||
- LocaleString_PerMilleSymbol => ? | ||
|
||
GlobalizationNative_GetLocaleTimeFormat | ||
- NSDateFormatter? | ||
|
||
GlobalizationNative_GetLocaleInfoInt | ||
- NSLocale | ||
|
||
GlobalizationNative_GetLocaleInfoGroupingSizes | ||
- ?? | ||
|
||
GlobalizationNative_GetLocales | ||
GlobalizationNative_GetLocaleName | ||
GlobalizationNative_GetDefaultLocaleName | ||
GlobalizationNative_IsPredefinedLocale | ||
- NSLocale | ||
|
||
GlobalizationNative_ToAscii | ||
GlobalizationNative_ToUnicode | ||
- No equivalent API for IDN/Punycode? | ||
|
||
GlobalizationNative_GetSortHandle | ||
- return reference to NSLocale | ||
|
||
GlobalizationNative_CloseSortHandle | ||
- release reference to NSLocale | ||
|
||
GlobalizationNative_GetSortVersion | ||
- ?? | ||
|
||
GlobalizationNative_CompareString | ||
- [NSString compare:options:range:locale:](https://developer.apple.com/documentation/foundation/nsstring/1414561-compare?language=objc) | ||
|
||
GlobalizationNative_IndexOf | ||
- [NSString rangeOfString:options:range:locale:](https://developer.apple.com/documentation/foundation/nsstring/1417348-rangeofstring?language=objc) | ||
|
||
GlobalizationNative_LastIndexOf | ||
- same as GlobalizationNative_IndexOf w/ NSBackwardsSearch | ||
|
||
GlobalizationNative_StartsWith | ||
- can be implemented trough GlobalizationNative_CompareString? | ||
|
||
GlobalizationNative_EndsWith | ||
- can be implemented trough GlobalizationNative_CompareString? | ||
|
||
GlobalizationNative_GetSortKey | ||
- wcsxfrm_l? | ||
|
||
GlobalizationNative_ChangeCase | ||
- NSString uppercaseStringWithLocale | ||
- NSString lowercaseStringWithLocale | ||
|
||
GlobalizationNative_ChangeCaseInvariant | ||
- NSString uppercaseString | ||
- NSString lowercaseString | ||
|
||
GlobalizationNative_ChangeCaseTurkish | ||
- Implemented in S.G.N using u_tolower/u_toupper, so easy to replicate | ||
|
||
GlobalizationNative_InitOrdinalCasingPage | ||
- Implemented in S.G.N using u_toupper, so easy to replicate | ||
|
||
GlobalizationNative_GetCalendars | ||
GlobalizationNative_GetCalendarInfo | ||
GlobalizationNative_EnumCalendarInfo | ||
GlobalizationNative_GetLatestJapaneseEra | ||
GlobalizationNative_GetJapaneseEraStartDate | ||
- NSCalendar (TBD: details) | ||
``` | ||
|
||
### ICU4X | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd love to see a paragraph about high-level goal for globalization dependencies for the future. Few ideas what could be covered
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be better if we first agree on the goals. the discussion #225 (comment) is about that and I love to hear your opinion about Jan's suggestion. |
||
|
||
It is interesting enough, there is a newly launched open-source project called [ICU4X](https://github.com/unicode-org/icu4x#icu4x----) for introducing globalization support libraries (similar to ICU). ICU4X has the exact goals we need to achieve with our concerned scenarios. | ||
|
||
``` | ||
ICU4X will provide an ECMA-402-compatible API surface in the target client-side platforms, including the web platform, iOS, Android, WearOS, WatchOS, Flutter, and Fuchsia, supported in programming languages including Rust, JavaScript, Objective-C, Java, Dart, and C++. | ||
|
||
The design goals of ICU4X are: | ||
- Small and modular code | ||
- Pluggable locale data | ||
- Availability and ease of use in multiple programming languages | ||
- Written by i18n experts to encourage best practices | ||
``` | ||
|
||
That is exactly what we need (or what we are trying to achieve). | ||
Eric Erhardt already did some investigation and arranged a meeting with the project committee to learn more about the project and the status of the project. It was very helpful getting in touch and learning more about this project. It is a promising project managed by experts and using the CLDR data which means it still sticking with the standards. | ||
|
||
The catch here is ICU4X still in the early stages and still not providing all functionality needed by .NET for globalization support. The project committee is willing to get and prioritize requests from different parties. We already communicated in the meeting what functionality .NET currently using from ICU4C and want to see that supported in ICU4X. For example, collation support is one of the topics we brought to the committee's attention. | ||
|
||
### Plan Suggestion | ||
|
||
ICU4X is the promising path we should pursue to address all concerned scenarios. We can wait a little bit more to get the needed missing functionality implemented in the project and integrate it to .NET as we did with ICU4C. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to see more detailed analyzes on where ICU4X can actually help. From the current description, we are trading one generic library with different generic library. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't the ICU4X listed goals not mentioning that?
I'll add a paragraph describing the differences and gain we get when using ICU4X There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It'd be also helpful where their goals align with ours and where don't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
(ICU4X core team member) - our goals can be fine tuned to match the needs of the industry. The project is young and we're intentionally flexible. I'd love to see such analysis, but I'd also like to suggest that once you identify any misalignments, consider discussing them with us for potential re-alignment by extension of our goals to supply your needs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we plan to be a consumer of ICU4x, should we also plan on being on investor? Where are the resources for ICU4X coming from? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ICU4X current effort is driven by engineers from Google and Mozilla under the umbrella of a Unicode Charter. We're curating the project for easy onboarding and put effort to stay inclusive for new contributors. |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reason for doing this is performance / size. As such, this should be heavily backed by numbers. I think the priority should be to get estimates for where we may land by doing this work.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this is close to impossible at the current time. The features we need just aren't there. So getting any sort of size estimate is purely guessing. We would need to implement the features first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do they have some interesting large chunk implemented that we can triangulate with the same chunk of classic ICU? It is hard to tell whether ICU4X is promissing path without any numbers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking through the features on their releases, DateTime formatting might be a candidate (that's what I was trying to prototype against a few months ago), but we don't really use ICU for the formatting logic, just to get the data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think it might be possible to get numbers for something but I'm not sure they would be very meaningful at this point. The ICU4X project, given its current state, is inherently speculative. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I would expect typical Blazor app to be in the first category. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, this is Shane from the ICU4X team. I'm happy to get you ballpark figures for code+data size of ICU4C vs ICU4X. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mozilla I18n Team is currently running an early signal performance testing of ICU4C vs ICU4X for our JS engine and internal needs. We hope to be able to provide some integration experience and performance/memory numbers within a month. If you're interested, we will be hosting a Deep Dive session in July where we plan to discuss the results (we'll also post them in ICU4X repository). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! And, to follow up from above, I posted some ballpark code size figures here: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Web developers are generally very sensitive to app download size and front end frameworks compete to be as small as possible. So in general I believe web devs will want just enough functionality to run their app and nothing more. There is a subset of web scenarios where app download size is less important, but it is the minority scenario. I suspect there are a significant number of web scenarios that could get by with just invariant mode and some light weight globalization helpers based on browser platform APIs, but @jkotas is probably right that the majority of web apps need broad globalization support, at least enough to handle their target audiences. |
||
- We need to look more at what is currently supported functionality and what is missing so we can have the full list of features we need to get. | ||
- We need to be in touch with the ICU4X committee communicating our requests and understanding when such features can be available. | ||
- Need to look if we can associate some resources to help with that project especially with the missing features we need in the .NET. | ||
- Need to look at the scope of work to integrate ICU4X to .NET runtime. | ||
- Need to look if we can have a process creating different NuGet packages from the ICU4X repo with data and code customization. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ICU4X is written in Rust. I am sure that it will create its share of interesting toolchain problems. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We put together a minimal proof of concept calling into icu4x from wasm, so it is at least possible (though we ran into a solvable problem with separate allocators fighting each other). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It does - I can confirm that from my prototyping. One mitigation is that we can build the pieces of ICU4X separately from For WASM - a problem I immediately hit was memory allocator differences between ICU4X and mono. We would need to solve that issue in order to use any Rust library in Blazor WASM. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @eerhardt I recall when we talked to the ICU4X committee they mentioned we can customize the memory allocator too. If that true, so we have a way to make this work. right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tarek, that's correct. Manish mentioned a way to tell it to use the global allocator, but I don't think anyone has actually spent more time on the prototype to actually validate that. I wouldn't consider it a major concern. |
||
|
||
.NET 7.0 would be the best to start invest in that and try to start consuming at least the available parts of ICU4X. | ||
|
||
### Alternative Plans | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there other alternatives? What about mixing native APIs with custom code/data or using custom compression for the data as another options? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "custom compression", not "custom comparison". I was actually wondering whether the data are compressed and how. Maybe that would be an explanation for the difference in the data files size between the Andorid/iOS builds if a different compression level/library was used. Mixing native APIs (as mentioned by @spouliot above) may actually be a way to explore. Sure, you may get slightly different behavior but likely the difference will be smaller than NLS->ICU switch since the underlying iOS API use the ICU data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh, thank you. I misread it. sorry about that. |
||
|
||
Other ideas would be similar to what the ICU4X is trying to do but maybe with scoped level. Here are the options we can try if we didn't go with ICU4X: | ||
|
||
- Invest more in ICU4C trimming. That will need spending more resources doing code analysis and figuring out how we may trim the ICU code to the level that can satisfy the size requirements. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can reject that as an alternative plan. We already spent months of work on this with no measurable improvements. |
||
- Write a code wrapper around the CLDR or ICU data. So we'll not use ICU code to access the data. This can work for locale data but I don't think that will be a good option for other functionality like collation as the code is more complicated and not easy to re-implement. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ICU is part of the Mac ecosystem. However we cannot directly use it, it's considered private (and using it would result in app rejections). However it means that for macOS (including Catalyst), iOS, tvOS (and even watchOS) Apple provide their own API built on top of ICU (code and or data). It might be possible to replace the globalization code to use those API (for the mentioned OS) instead of providing both ICU code and data. Since most of the code (and data) is part of the OS then the resulting app size should be reduced and closer to the "legacy" numbers. E.g. xamarin/xamarin-macios#10249 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the problem with this tends to be APIs not lining up or returning different data from what we've established as the standard on other platforms? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the Mac APIs at the end depend on the ICU, then I expect the behavior will be close as if we use ICU directly. @CoffeeFlux do you have more information about the differences? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had a very rough look at how the Cocoa API surface maps to the System.Globalization.Native API surface. It seems to be quite promising even if not a full match. The missing parts could probably be filled in with custom data. GlobalizationNative_NormalizeString
GlobalizationNative_IsNormalized
GlobalizationNative_WindowsIdToIanaId [Not supported on app-local iOS/WASM ICU] GlobalizationNative_GetTimeZoneDisplayName [Not supported on app-local iOS/WASM ICU]
GlobalizationNative_GetLocaleInfoString
GlobalizationNative_GetLocaleTimeFormat
GlobalizationNative_GetLocaleInfoInt
GlobalizationNative_GetLocaleInfoGroupingSizes
GlobalizationNative_GetLocales
GlobalizationNative_ToAscii
GlobalizationNative_GetSortHandle
GlobalizationNative_CloseSortHandle
GlobalizationNative_GetSortVersion
GlobalizationNative_CompareString GlobalizationNative_IndexOf GlobalizationNative_LastIndexOf
GlobalizationNative_StartsWith
GlobalizationNative_EndsWith
GlobalizationNative_GetSortKey
GlobalizationNative_ChangeCase
GlobalizationNative_ChangeCaseInvariant
GlobalizationNative_ChangeCaseTurkish
GlobalizationNative_InitOrdinalCasingPage
GlobalizationNative_GetCalendars
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is a good input to have. This path is worth to explore more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most importantly it seems to cover the collations which are hard to implement. The things I could not find at first glance seem to be simple data retrieval. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @filipnavara I'll add some section to the doc for the option trying to use the Cocoa APIs for Apple OS's. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Based on Mozilla experience, one consideration I must share is that doing that will lead to your software having diverging capabilities between platforms where you use ICU4C that you vendor in, vs platforms where you rely on the OS one. ICU API does change, and in particular active large scale software often goes through a cycle where a need for an ICU4C is identitied, requested, supplied and then if the ICU4C can be vendored in, the next release contains new code. If you rely on OS ICU, the path may be extended to many years of wait which may be a late realization of a drag factor if you were to be affected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zbraniecki I agree with your consideration. What we are trying to do here is we weigh the options and see pros and cons for every option. But in general, if consistency across platforms and OS's is available option with reasonable cost for the scenarios we are targeting (e.g. mobile and WebAssembly scenarios) then that should be the option we go with. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this be expanded beyond a one-line description with a comparison against the other plans? |
||
|
||
Any option we choose here will need to have an automated process to extract the needed data and code and pack it in a NuGet package. Also, whatever plan we consider, would be considered for the next release as this is not trivial work to do for .NET 6.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the sizes of ICU globalization support for Xamarin apps and Wasm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iOS arm64
icudt.dat 1,713,152
icudt_CJK.dat 1,006,224
icudt_EFIGS.dat 602,096
icudt_no_CJK.dat 1,271,296
libicudata.a 720
libicui18n.a 3,536,472
libicuuc.a 2,286,376
Android arm64
icudt.dat 1,512,896
icudt_CJK.dat 966,080
icudt_EFIGS.dat 559,568
icudt_no_CJK.dat 1,082,128
libicudata.a 1,116
libicui18n.a 7,105,274
libicuuc.a 4,267,870
Browser WASM
icudt.dat 1,512,896
icudt_CJK.dat 966,080
icudt_EFIGS.dat 559,568
icudt_no_CJK.dat 1,082,128
libicui18n.a 4,077,896
libicuuc.a 2,510,674
The other architectures have similar sizes so I am omitting them for clarity. Only one of the
icudt*.dat
files is linked to the app. They are user selectable through .csproj property and contain differently trimmed configurations (full; Chinese/Japanese/Korean, English/French/Italian/German/Spanish, full without Chinese/Japanese/Korean). Notably it's weird that the iOS data file is bigger and it leads me to believe it's some build configuration error (cc @akoeplinger @directhex).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not quite sure why the Android libraries would be twice as big on iOS. That would warranty some explanation too.
UPD: Looks like the Android ones may have debug symbols and the iOS ones don't. This is not 100% confirmed but stripping the Android ones reduces the size by around 50% and
file
output on the original files inlibicu18n.a
shows something likeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, .a files is not what actually ships in the app. It would be more interesting to look at the size contribution after the .a files get statically linked into the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not trivial to measure but xamarin/xamarin-macios#10249 (comment) has some numbers. The .a files already contain heavily stripped ICU4C with disabled features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm an idiot, the transport only contains .a files so we should be stripping whatever those get linked into post-hoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The native ICU bits end up being about 380KB compressed on wasm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Blazor WASM - when we go from the "default/full config" to the "min config" (which sets GlobalizationMode.Invariant, plus other things),
dotnet.wasm
drops from (all sizes are .br compressed)737.0 KB
to384.0 KB
.@CoffeeFlux can get how much of that is just ICU, but my understanding is the majority of that size decrease is ICU getting linked out. This means the ICU library is contributing about as much code size as all of the mono runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note at least for WASM scenarios, I am seeing some people complains about some missing globalization data for their scenario. which suggest we need to have a better coverage or at least to give the option to have better coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll include the numbers reported here by @filipnavara and @CoffeeFlux to the doc. thanks for providing these.