Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace DataLocale by LanguageIdentifier on DataRequest #4984

Closed
wants to merge 20 commits into from

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jun 1, 2024

Unicode extension keys are now modeled as DataKeyAttributes.

#3632

@robertbastian robertbastian changed the title Replace DataLocale by LanguageIdentifier on DataRequest, Locale on API Replace DataLocale by LanguageIdentifier on DataRequest, ~Locale on API~ Jun 1, 2024
@robertbastian robertbastian changed the title Replace DataLocale by LanguageIdentifier on DataRequest, ~Locale on API~ Replace DataLocale by LanguageIdentifier on DataRequest Jun 1, 2024
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: It looks approximately what I expect to see but it's a big PR so I haven't finished looking at every file

components/calendar/src/any_calendar.rs Outdated Show resolved Hide resolved
@robertbastian robertbastian requested a review from sffc June 3, 2024 15:30
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down through provider/core; next stop provider/datagen

Comment on lines +162 to +170
let metadata_payload: DataPayload<crate::provider::CollationMetadataV1Marker> = provider
.load(req)
.or_else(|_| {
provider.load(DataRequest {
langid: req.langid,
..Default::default()
})
})?
.take_payload()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion, here and elsewhere: perhaps refactor to make more clear why you are making two requests

Suggested change
let metadata_payload: DataPayload<crate::provider::CollationMetadataV1Marker> = provider
.load(req)
.or_else(|_| {
provider.load(DataRequest {
langid: req.langid,
..Default::default()
})
})?
.take_payload()?;
let metadata_payload: DataPayload<crate::provider::CollationMetadataV1Marker> = provider
.load(req_specific)
.or_else(|_| {
provider.load(req_default)
})?
.take_payload()?;

Default::default(),
)
.unwrap();
assert_writeable_eq!(
dtf.format(&datetime),
*expected,
"\n\
locale: `{}`,\n\
locale: `{:?}`,\n\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought/Issue: We probably want to keep a Display impl on Preferences?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to make that assumption now. It's not a problem to use Debug in a test.

Comment on lines -68 to -69
// Always retain -u-sd
SUBDIVISION_KEY => true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought/Issue: We don't currently use -u-sd but it would be nice to support it in LocaleFallbacker for future compatibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new requirement, please don't block the PR on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't block the PR on it but it's not a new requirement; the old fallbacker supported -u-sd so it's a regression in behavior that it doesn't support it now and doesn't have a clear path to supporting it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a design discussion in March to drop unicode keywords from the fallbacker and this was not brought up since then. This will require a lot of rethinking.

Comment on lines -298 to -306
TestCase {
// TODO(#4413): -u-rg is not yet supported; when it is, this test should be updated
input: "en-u-rg-gbxxxx",
requires_data: false,
extension_key: None,
fallback_supplement: None,
expected_language_chain: &["en"],
expected_region_chain: &["en"],
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: We should fix this in 2.0 because it might require an architecture change in LocaleFallabcker. #4413

("en-019", "", "Hello from 🌎"), // AMERICAS
("en-142", "", "Hello from 🌏"), // ASIA
("en-GB", "", "Hello from 🇬🇧"), // GREAT BRITAIN
("en-GB", "england", "Hello from 🏴󠁧󠁢󠁥󠁮󠁧󠁿"), // ENGLAND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: england should remain a subdivision, not a key attribute... please at least add a TODO(#4413)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as designed

@sffc
Copy link
Member

sffc commented Jun 3, 2024

Reasons we may want to keep DataLocale and not use LanguageIdentifier:

  1. We can make it Copy by supporting only a single variant
  2. We can support subdivision
  3. We can add other helper functions and are not bound to the constraints of LanguageIdentifier

@robertbastian
Copy link
Member Author

Reasons we may want to keep DataLocale and not use LanguageIdentifier:

We had agreed on this and I have implemented this, which was a non-trivial amount of work.

We can make it Copy by supporting only a single variant

What does this gain us?

We can support subdivision

Ok, but there are other ways as well

We can add other helper functions and are not bound to the constraints of LanguageIdentifier

Having logic duplicated between DataLocale and icu_locale really bothered me about the old DataLocale, because that was code duplication, as well as double the API surface, with subtle API differences. I don't see LanguageIdentifier as a constraint, if we need more functionality we should add it there so that other users can benefit from it (like I added is_und in this PR).

@robertbastian robertbastian requested a review from sffc June 3, 2024 17:06
@sffc
Copy link
Member

sffc commented Jun 3, 2024

Reasons we may want to keep DataLocale and not use LanguageIdentifier:

We had agreed on this and I have implemented this, which was a non-trivial amount of work.

We hadn't considered the restriction of being unable to support u-sd as easily unfortunately

We can make it Copy by supporting only a single variant

What does this gain us?

It's a little cheaper 🤷

We can support subdivision

Ok, but there are other ways as well

How? Let's discuss

We can add other helper functions and are not bound to the constraints of LanguageIdentifier

Having logic duplicated between DataLocale and icu_locale really bothered me about the old DataLocale, because that was code duplication, as well as double the API surface, with subtle API differences. I don't see LanguageIdentifier as a constraint, if we need more functionality we should add it there so that other users can benefit from it (like I added is_und in this PR).

I understand this point of view but we made the conscious decision to duplicate the API surface in 1.0 because we established the understanding that the needs of locales in the context of data providers was different than the needs of locales in the context of UTS 35. (I was hopeful that the new DataRequest could maybe allow us to align, but I've never been fully convinced that merging DataLocale back into LanguageIdentifier was worthwhile)

@sffc
Copy link
Member

sffc commented Jun 3, 2024

^ If we can demonstrate that there's a good way to support -u-sd and -u-rg in this design, I'm onboard

@Manishearth
Copy link
Member

(currently feel like no strong opinion on whether we should land this. happy to take part in discussions, but no need to wait for my review, quickly going through this I don't see a reason why I would block this)

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Jun 4, 2024
@sffc
Copy link
Member

sffc commented Jun 6, 2024

Current thinking of @sffc @zbraniecki:

  • Subtags could be about preferences/hints and data identity.
  • -u-sd is either a preference (I want Scottish unit preferences) or a data identity (I want Catalan as spoken in Valencia)
  • Variants are a grab bag that encode different concepts such as orthography (POLYTON), subdivision (VALENCIA), and operating system (MACOS), and probably other things, too. (full list)
  • @zbraniecki notes that subdivision suffixes are 1-3 alphanums. This means in principle we could encode them in an ICU4X language identifier following a region, such as en-US-ca, without risking conflict with variants or other subtags. When converting to Locale, we make it en-US-u-sd-usca, and when converting to LanguageIdentifier, we drop the subdivision. ca_ES_VALENCIA would eventually become ca-ES-vc. Would also be nice to drop the variant subtag.
  • We could consider using this new type in Preferences. Multi-variants would be mapped to their respective proper keywords, with keywords added as necessary.
  • @sffc notes that variants could be used by prepending "sd", such as en-US-sdusca (guaranteed to be 5-7 alphanums), if we wanted to stick with the traditional LanguageIdentifier representation.
  • Would like to bring this to the CLDR Design WG for further discussion.

@sffc
Copy link
Member

sffc commented Jun 6, 2024

@zbraniecki
Copy link
Member

Subtle adjustment:

-u-sd is either a preference (I want Scottish unit preferences) or a data identity (This is Valencian data)

@sffc
Copy link
Member

sffc commented Jun 11, 2024

We discussed this at the CLDR design meeting. There wasn't a clear conclusion on what fields should be in data locale, but it was evident that this could change over time. An approach that seems like it would make sense for us would be to restrict to a single subtag on top of LSR which contains an implementation-defined enumeration of data identity. For example, we lump all variants and subdivisions into our own namespace.

@robertbastian
Copy link
Member Author

Discussion

#[derive(Copy, ...)]
pub struct DataLocale {
    language: icu_locale_core::subtags::Language,
    script: Option<icu_locale_core::subtags::Script>,
    region: Option<icu_locale_core::subtags::Region>,
    subdivision: Option<icu_locale_core::extensions::unicode::SubdivisionSuffix>,
    single_variant: Option<icu_locale_core::subtags::Variant>,
}
  • simplify DataLocale API by removing extensions stuff
  • have getters and setters for the singleton fields, as well as is_und/is_empty/is_default
  • move to icu_preferences?
  • fallback still uses DataLocale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss-priority Discuss at the next ICU4X meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants