Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Enable globalization invariant mode for all runtime images #1877

Closed
richlander opened this issue Apr 29, 2020 · 27 comments
Closed

Comments

@richlander
Copy link
Member

richlander commented Apr 29, 2020

Proposal: Enable globalization invariant mode for all runtime images

We propose to reduce runtime images by ~12MB (compressed; ~31MB uncompressed) by no longer installing the ICU package for Debian- and Ubuntu-based images, and instead rely on globalization invariant mode, by default. The .NET runtime and libraries depend on ICU, on Linux, for globalization behaviors (sorting, time zones, currency symbols, date formats, ...). We already enable globalization invariant mode and do not install ICU with Alpine runtime images.

We propose to (A) take advantage of this size improvement for Debian and Ubuntu images, and (B) make .NET images symmetric across Linux distros. In short, we like what we did for Alpine, but no longer want Alpine to be a special case.

All Linux-based .NET SDK images will continue to contain ICU. For example, Alpine .NET SDK images contain ICU, even though Alpine runtime images do not. As a point of policy for SDK images, we value UX over size, and intend for SDK images to provide a "batteries included" model. This is, in part, because it is more inconvenient, for users, to add packages to SDK images for some scenarios. This is a tradeoff, as it adds an unfortunate point of asymmetry between runtime and SDK images, but one that we believe is warranted.

We made an analogous change in #1848 where we removed a Debian- and Ubuntu-specific layer that Alpine did not have. After that change, Debian and Ubuntu SDK images are smaller, and the layering across .NET SDK images for Linux distros is now the same.

Context

As part of the .NET Core 2.0 release, we created globalization invariant mode. This feature, when enabled, removes any dependence on external libraries for globalization information by using the invariant behavior for all globalization-sensitive APIs (like sorting, understanding time zones and writing currency symbols). For many applications, this mode is a win because they are not dependent on globalization concepts and behaviors.

This new mode was developed at the same time as we added support for the Alpine Linux distro. The Alpine project is known for publishing small container images, and we wanted to do everything we could to make Alpine-based .NET Core images small. We decided to take advantage of globalization invariant mode and not install ICU in Alpine images by default, and instead let users who need globalization enable it for themselves. This seemed like a great trade-off at the time, and we haven't heard any negative feedback on it. We have however heard that many people are happy with .NET Alpine images, and have seen their usage grow considerably.

Size details

We built the dotnetapp sample a few different ways and published the results at richlander/dotnetapp. The tags listing provides the compressed sizes. The same images are displayed below, with uncompressed size information.

rich@mazama:/mnt/d/git/dotnet-docker/samples/dotnetapp$ docker images
REPOSITORY                              TAG                            IMAGE ID            CREATED             SIZE
richlander/dotnetapp                    debian                         c83a4ad65881        54 minutes ago      190MB
richlander/dotnetapp                    latest                         c83a4ad65881        54 minutes ago      190MB
richlander/dotnetapp                    alpine-globalization-enabled   1aa6fb6af249        2 hours ago         119MB
richlander/dotnetapp                    alpine                         75670cc0cd25        2 hours ago         87.3MB
mcr.microsoft.com/dotnet/core/runtime   3.1-alpine                     50c357d06fee        5 days ago          87.2MB
alpine                                  latest                         f70734b6a266        5 days ago          5.61MB

Legend:

@stephentoub
Copy link
Member

My main concerns would be:

  • With the SDK images containing ICU and the runtime images not, it becomes more likely that relevant bugs will make it to production.
  • As you highlighted, Alpine is known for small size, with various optimizations up and down the stack to get it there, so it's less of a surprise that globalization data isn't available by default there. I worry it'll be more surprising for other images, where absolute minimal footprint isn't as primary a goal.
  • This will end up effectively being a breaking change for someone moving to the newer versions of these images. We'd need to shout from the rooftops about the change to avoid related issues, and even that I expect there will be a tail of issues here that find their way back through support / GitHub.
  • Does 12MB out of 190MB (if I'm understanding the numbers correctly) move the needle sufficiently for this to be "worth it"?

@danmoseley
Copy link
Member

  • For someone who needs ICU today, they may not be aware what it is. Or, know how to add it to their dockerfile, but not to set the other environment variables. Presumably we would document the necessary dockerfile lines for all the major distros?
  • You mention that folks seem happy with the Alpine images. Do we have data on how many are using them as-is vs. adding ICU to them? Perhaps they just like Alpine, but do need ICU?
  • Do we have any data on how commonly apps are satisfied with invariant mode? Have we asked the community for example?

@tarekgh
Copy link
Member

tarekgh commented Apr 29, 2020

I don't think this will be a good idea. From what I am seeing, almost 90% of users will need to turn off the invariant mode and install the needed ICU packages. I saw some issues users had Invariant mode turned on and ran into problems that were not easy for them to figure out what is going on.

@richlander
Copy link
Member Author

I get all of this feedback, however, Alpine usage is growing. What do we do when half of our pulls are Alpine? Would that change the dynamic?

We don't have data on whether people use Alpine images as is or add ICU on top. This is the best we have: https://github.com/search?q=ENV+DOTNET_SYSTEM_GLOBALIZATION_INVARIANT+false&type=Code

My motivation is to enable pay-for-pay, at the possible expense of extra work and some confusion. Is 12MB worth it? Yes. This win has been valuable for Alpine, and I no longer want Debian and Ubuntu to have asymmetry with Alpine. The rationale for that asymmetry isn't justified.

@stephentoub
Copy link
Member

What do we do when half of our pulls are Alpine? Would that change the dynamic?

I'm missing why that's relevant. There are many other differences between the distros, no?

@jkotas
Copy link
Member

jkotas commented Apr 29, 2020

I believe we would need to make changes to invariant mode to make this viable:

Alpine usage is growing.

Is there a way to get distribution between English vs. non-English speaking countries for usage of our Alpine images? My hypothesis is that our Alpine images are used relatively less in non-English speaking countries.

@richlander
Copy link
Member Author

I'm missing why that's relevant. There are many other differences between the distros, no?

True. But this isn't one of them. It's an arbitrary choice we made for one distro.

My hypothesis is that our Alpine images are used relatively less in non-English speaking countries.

Great thought. I'll see if we have any information that can at least point us in that direction.

First, producing container images for a platform is very hard. Since docker has a single line of inheritance, you have to make a variety of trade-offs. In general, it makes sense to make to decide up-front what you value and then use that value-orientation for every single decision. Otherwise, you end up with something that has a bunch of interesting characteristics but is "blah" in aggregate. Clearly, we've decided that size is our #1 metric.

In short, you have the following three choices, pick two:

  • Limit image variants
  • Limit image size
  • Improve ease of use

We value those attributes in that order.

From what I am seeing, almost 90% of users will need to turn off the invariant mode and install the needed ICU packages.

This is a great point. Even if 90% of users needed invariant mode disabled, I'd still have this plan. I'm focused on building a competitive product that makes .NET a great choice for those 10% of users that need the smallest size possible.

I think of this topic as being directly connected to Jan's form factors doc. Based on the way Docker works, we certainly could create multiple sets of images that effectively implement multiple form factors, but we're not going to. We're going to do one, and it's going to focus on getting images smaller and smaller.

We're going to make this change. We just need to decide when.

Let's make invariant mode better. I hadn't thought of wasm being aligned with invariant mode.

@jkotas
Copy link
Member

jkotas commented Apr 30, 2020

We value those attributes in that order.

We strike balance between these attributes by having all Ubuntu-, Debian- and Alpine- based images. Why do have all 3 instead of just 1? I believe that it is because of Ubuntu and Debian ones are easier to use than the Alpine one.

@GrabYourPitchforks
Copy link
Member

I saw some issues users had Invariant mode turned on and ran into problems that were not easy for them to figure out what is going on.

@tarekgh Can you give some examples? Are these things we'd be able to work around within the runtime itself? @jkotas had mentioned allowing case conversion of non-ASCII characters. If we carried this data it would only be a few KB. But if common scenarios require customers to install ICU anyway then I have a hard time justifying us carrying around our own copy of the data.

@richlander Is this part of a larger effort to shrink size-on-disk for the Alpine distro? I've had some offline conversations with folks re: having "fast" (but large) and "small" (but slower) versions of our code paths. The idea is that we'd ifdef in whichever one was appropriate for the target platform. I haven't done significant analysis on how much footprint this would save overall so I don't know if it's worth pursuing.

@richlander
Copy link
Member Author

Why do have all 3 instead of just 1?

Ha! I wish we could have just one.

The short version is this:

  • Debian is the historical default disto on Docker Hub. We adopted that when we started in 2015. If we were to start again, and we had the 5.0 (not 1.0) product, we might make a different set of choices. While I've demonstrated an acceptance of breaking changes to deliver value, switching away from Debian as the default is a much larger breaking change than I would ever accept.
  • Alpine got added because it offered a very different set of characteristics than Debian. It also got added in a time when we started to receive vulnerability scans with vulnerabilities in .NET Core Linux images. So, Alpine was an important choice for size and security (and user asks exactly aligned with that).
  • Ubuntu was a bet. We saw a ton of usage of Ubuntu in other modalities (general use, and APT pulls of .NET Core). In Docker (for .NET), Ubuntu usage is small. It probably "pays its way" in terms of being warranted. That said, we only support LTS versions as a result. Part of the issue is that Debian is good enough for Ubuntu users in many cases. Many of the same scripts will just work (same package manager, shell, ...).
  • CentOS is the one distro we've talked about adding. Like I said with Ubuntu, we have signal on CentOS. CentOS is attractive, because, like Alpine, it is different than Debian. It's also a good connection with the Red Hat ecosystem. That all said, I don't think we have enough signal on CentOS yet, which is why haven't published those images yet.

It's amazing to reason about pull behavior across Docker and APT, as two examples. The patterns are super different and the OSes people prefer (in aggregate) as super different. And what people value in those modalities is super different. For example, we see pretty much constant pulls in Docker, day in, day out. For APT, we see a huge surge of pulls in the first 36 hours after a release, and then back to a much lower constant set of pulls after about 5 days.

@richlander
Copy link
Member Author

Is this part of a larger effort to shrink size-on-disk for the Alpine distro?

No, it is specifically not that. We already did that, starting with Alpine with .NET Core 2.1. This is about applying that same win to Debian and Ubuntu.

@jkotas had mentioned allowing case conversion of non-ASCII characters. If we carried this data it would only be a few KB. But if common scenarios require customers to install ICU anyway then I have a hard time justifying us carrying around our own copy of the data.

ICU is 30MB+ (uncompressed). It's worth talking about ways to avoid it. We don't necessarily need to ship those data files in the runtime. We could download them for the Docker scenario. We download plenty of things today, at docker build time, and are happy to add more if there is value.

Also, we shouldn't be making optimization choices around small numbers of KBs to the product in isolation. On the runtime team, we blow those away with our crossgen choices (in either direction). For example, we used partial crossgen in 3.0 to save about 10MB in container images. We can pay for your data file cost with change we find behind the couch. We have a bunch more crossgen work planned for 5.0. We don't have any insight on size impact yet.

@richlander
Copy link
Member Author

@GrabYourPitchforks -- It would be awesome to have this information:

  • Scenarios that we can cover with a data file.
  • Example APIs for those scenarios.
  • Example APIs from peer application platforms that rely primarily on data files to satisfy a given scenario.
  • Examples of data files that satisfy our needs.
  • Cases that still require a full globalization stack (ICU, Windows APIs, ...)

@marek-safar
Copy link

I believe we would need to make changes to invariant mode to make this viable

It'd be nice to have the invariant mode more developers-friendly but at the same time as we are also having a conversation with @danmosemsft team how to make the globalization support more configurable which could help here as well. The current setup where you go either with no globalization or full-blown ICU is not enough for a growing number of form factors and scenarios .NET is targeting.

@tarekgh
Copy link
Member

tarekgh commented Apr 30, 2020

To answer @GrabYourPitchforks question:

@tarekgh Can you give some examples? Are these things we'd be able to work around within the runtime itself?

One example, it is reported a problem that the resource lookup is not working on one of the user machines and working fine on other machines. The user had no idea about the invariant mode and didn't know what is wrong there. Resource lookup depends on the culture parent chain which of course is not provided with the Invariant mode and the resource lookup fails to get the right resources.

@GrabYourPitchforks
Copy link
Member

@richlander anything that involves non-linguistic case comparison will work. Consider the following examples.

// In Invariant mode, returns "MAñANA"  <-- note the 'ñ' was left unchanged
// Under ICU / NLS, returns "MAÑANA"
// Under invariant mode with our own casing data, returns "MAÑANA"
string result = "mañana".ToUpperInvariant();

// In Invariant mode, returns false
// Under ICU / NLS, returns true
// Under invariant mode with our own casing data, returns true
bool areEqual = string.Equals("mañana", "MAÑANA", StringComparison.OrdinalIgnoreCase);

By carrying our own casing data, we can determine that 'ñ' and 'Ñ' are actually the same character (with different casing). This means that ToUpperInvariant and string.Equals(..., OrdinalIgnoreCase) will behave as expected.

This does not include support for normalization or linguistic comparisons. Consider the following examples.

// In Invariant mode, returns false
// Under ICU / NLS, returns true
// Under invariant mode with our own casing data, returns false
bool areEqual = string.Equals("ss", "ß", StringComparison.InvariantCulture);

// In Invariant mode, returns false
// Under ICU / NLS, returns true
// Under invariant mode with our own casing data, returns false
bool areEqual = string.Equals("encyclopaedia", "encyclopædia", StringComparison.InvariantCulture);

InvariantCulture is a linguistic comparison, which means that it needs to account for the fact that "ss" and "ß" are semantically identical; as are "ae" and "æ". Our casing data does not handle these conditions.

For servers this is generally OK. Most server applications deal with things like identifiers, usernames, filenames, paths, etc.; so they should only ever be using Ordinal or OrdinalIgnoreCase, not any other StringComparison. (ToUpperInvariant and ToLowerInvariant would likewise work. Despite their names, their behavior maps roughly to OrdinalIgnoreCase and has nothing whatsoever to do with CultureInfo.InvariantCulture. It's confusing but it's what we're stuck with.)

For clients this is a bit more problematic. A client app would want localization and would want culture-aware textual analysis. If I visit https://en.wikipedia.org/wiki/Encyclopedia and CTRL-F and type "encyclopædia" into my browser's search box, I want it to find both "encyclopædia" and "encyclopaedia" on the page. Something like this would require the full power of ICU / NLS.

Servers that need to display data in a localized fashion also fall under this latter category. If the visitor is browsing from the United States, I want to display pricing using the U.S. currency symbol ('$') and with decimals formatted in a manner familiar to a U.S. audience. If the visitor is browsing from Japan, I want to display pricing using the yen currency symbol ('¥') and with digits formatted appropriately. If you need this kind of localization data, you'll require the full power of ICU / NLS.

Does this help clarify the scenarios a bit?

@MichaelSimons
Copy link
Member

This isn't being considered for 5.0 but is something we are interested in driving post-5.0.

@tarekgh
Copy link
Member

tarekgh commented Jun 10, 2020

linking to the issue dotnet/runtime#37349 for awareness about IDN functionality difference with the Invariant mode and potential wrong behavior in the networking stack depending on IDN.

@richlander
Copy link
Member Author

@tarekgh
Copy link
Member

tarekgh commented Jun 16, 2020

Note that, having TZData is not related to enabling the Globalization invariant mode. TZData is independent bits to install to get TZ support.

@richlander
Copy link
Member Author

richlander commented Jun 16, 2020

Great point. It's not directly related, as you say. My point is that it is a near-neighbor problem, with similar characteristics and UX.

I'd like to start an early 6.0 proposal along the lines of Jan's comment. We should include tzdata in that.

I was just talking to the wasm team about this. They expressed that they are struggling with ICU (significantly more than the Docker scenario) and would appreciate a better solution for 6.0 that doesn't require ICU.

Cool?

@tarekgh
Copy link
Member

tarekgh commented Jun 16, 2020

I was just talking to the wasm team about this. They expressed that they are struggling with ICU (significantly more than the Docker scenario) and would appreciate a better solution for 6.0 that doesn't require ICU.

Is there more info here? what they are struggling with ICU? in general, it is good we start having a 6.0 proposal from now as you mentioned so we can have enough time to react to the needed change.

Yes, cool :-)

@richlander
Copy link
Member Author

Same reason ... size impact. Size constraints of wasm are like 10x more restrictive than containers. More concretely, the wasm team is slicing and dicing ICU itself to reduce size. This isn't a great model. Mono libraries have NLS-style in-product tables/data (actually stale data copied from ICU), but the wasm project is leaving that behind since it is moving to corefx.

@MichaelSimons MichaelSimons added this to the 6.0 milestone Nov 11, 2020
@mthalman mthalman moved this to Backlog in .NET Docker Dec 1, 2021
@mthalman mthalman modified the milestones: 6.0, .NET 8 Oct 19, 2022
@mthalman
Copy link
Member

@richlander - This has been dormant for a while now. Any thoughts on this for .NET 8?

@tarekgh
Copy link
Member

tarekgh commented Oct 19, 2022

+@steveisok @lewing to advise if they still running into the size problems.

@mthalman are you running into some issue because of the size?

@steveisok
Copy link
Member

+@steveisok @lewing to advise if they still running into the size problems.

We pull in icu from dotnet/icu, so I do not think our workloads would be negatively impacted.

@lewing ?

@mthalman
Copy link
Member

This would also be impacted by whatever outcome we have from #4162. If we have a distroless Alpine offering, then we may want to make different choices with the full version of Alpine, like including icu.

@mthalman mthalman changed the title Proposal: Enable globalization invariant mode for all .NET 5.0+ runtime images Proposal: Enable globalization invariant mode for all runtime images Oct 26, 2022
@richlander
Copy link
Member Author

We're no longer pursuing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

10 participants