Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting with localization #3997

Open
tertsdiepraam opened this issue Oct 3, 2022 · 5 comments
Open

Starting with localization #3997

tertsdiepraam opened this issue Oct 3, 2022 · 5 comments

Comments

@tertsdiepraam
Copy link
Member

TL;DR: I want to add a new util for locale generation and provide locale-aware functionality in uucore


uutils is currently following the C locale for most of its operations and the locale settings of the system are mostly ignored. This has led to issues and PRs like these:

We've mostly been putting this off due to missing libraries in Rust, but recently, this has changed with the release of icu4x. It covers many of the things we need like locale-aware datetime formatting, locale-aware collation, etc..

However, it requires data to operate on, which is different from the usual data generated by locale-gen and friends (if I understand correctly). There are essentially 2 viable ways to include data with icu4x1:

  1. Store a blob on the filesystem to read at runtime (BlobDataProvider).
  2. Encode the data as Rust code included in the binary (BakedDataProvider).

Since we don't know up front what locales we might need, I think we need to use the BlobDataProvider and allow the user to generate their own locale data on command. So, I propose we do the following:

  1. Add a new util, called locale-gen or something similar
    • This util downloads and stores the locale data in a global directory (I'm not sure where, could also be controlled by an environment variable).
    • This util would be a wrapper around the icu_datagen crate2.
    • It could also read from system config files and install any necessary locales based on the system config automatically.
    • Since this util needs access to the internet, we will run into similar issues like we did with uudoc back when it automatically downloaded examples, so it needs to be optional.3
  2. Create locale-aware functionality in uucore as much as possible, so that the utils themselves don't have to bother with checking the right environment variables, loading the icu data, etc..
    • For example, to check the collation locale, the LC_COLLATE, LC_ALL and LANG env vars need to be checked.
    • For the utils, we then just expose a sort/collate function that checks (and caches) the locale and performs the correct collation.
  3. Change the utils to use the locale-aware functions provided by uucore.

Do you see any problems with this approach? Are there alternatives we should explore first?

Footnotes

  1. They also have FsDataProvider which is meant for development only.

  2. This crate also has a CLI, but we need to tailor it for use with coreutils, by setting nicer defaults for our purpose.

  3. icu_datagen uses reqwest, which will lead to similar problems as in https://github.com/uutils/coreutils/pull/3184

@tertsdiepraam
Copy link
Member Author

There is also rust_icu, which is a wrapper around ICU4C, which works without additional datagen, but it's a big C dependency. So I guess we have to choose between C code or custom datagen?

@tertsdiepraam
Copy link
Member Author

I'm no longer sure rust_icu works without datagen. icu4c also has a different data format from POSIX. I think this only future-proof way forward is to embrace icu4x's data format. I wonder if the Unicode folks are willing to spec out some standard location for this data and provide some tools for managing it. It'd be nice if all applications build using icu4x that want to store the data in the filesystem could share their data.

@VorpalBlade
Copy link
Contributor

VorpalBlade commented Feb 4, 2024

I was running into essentially the same problem for my own command line tools.

  • Did you figure out a standard location to store the data?
  • What about translations, icu4x seems to handle everything except for LC_MESSAGES? Or am I missing something?
  • Could you consider putting the logic for locale env parsing, etc in a separate crate rather than uucore, so other projects outside of uutils can reuse it (without copy pasting code)? It would be good to be able to solve this for all sorts of POSIX command line tools rather than reinvent the wheel every time. Especially with proper support for mixed locales (as you are considering it seems, and I use, but few others care about it).

@tertsdiepraam
Copy link
Member Author

Did you figure out a standard location to store the data

Not yet. We should start talking to some people about that :)

What about translations, icu4x seems to handle everything except for LC_MESSAGES? Or am I missing something?

Translations are out of scope for a while for us I think, but if you want it, I think Project Fluent is the gold standard there.

Could you consider putting the logic for locale env parsing, etc in a separate crate rather than uucore, so other projects outside of uutils can reuse it (without copy pasting code)?

If there is a significant amount of code, it should definitely go in a separate crate.

Especially with proper support for mixed locales (as you are considering it seems, and I use, but few others care about it).

Yeah I think we should support mixed locales. At least, if by mixed locale you mean that for example collation is done in one locale and number formatting in another or something like that. icu4x can do all of that I believe.

@VorpalBlade
Copy link
Contributor

Yeah I think we should support mixed locales. At least, if by mixed locale you mean that for example collation is done in one locale and number formatting in another or something like that. icu4x can do all of that I believe.

Exactly. I use LC_MESSAGES in English (for searchability and because translations tend to be poor), but I use sv_SE.UTF-8 for everything else, except for collate where I prefer C.UTF-8 for case sensitive sort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants