Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove generic metazone values that match location values #5751

Merged
merged 5 commits into from
Oct 30, 2024

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Oct 30, 2024

-100KB

Generic non-location format ("Central European Time", "Armenia Time") falls back to location ("Zurich Time", "Armenia Time"), so we can remove duplicates.

There are some entries that I feel should be deduplicated, but aren't (metazone name listed first):

  • The metazone name is more specific than the location name. This is weird, because for the location format, the region should be used if there's only one time zone in the country. We do this, but then the metazone exists and reverts that
    • Apia Time vs Samoa Time
    • Pyongyang Time vs North Korea Time
    • Taipei Time vs Taiwan Time
    • Petropavlovsk-Kamchatski Time vs Kamchatka Time
  • The generic meta zone includes "standard". This is a data issue, the problem is probably that for non-DST metazones, the only entry is standard, which might confuse linguists:
    • Guam Standard Time vs Guam Time
    • Singapore Standard Time vs Singapore Time
  • The match is not perfect. Some of this data definitely needs to be cleaned up, but others might be fixed by using a different region display name in datagen.
    • Brunei Darussalam Time vs Brunei Time
    • Cocos Islands Time vs Cocos (Keeling) Islands Time
    • Dumont-d’Urville Time vs Dumont d’Urville Time
    • East Timor Time vs Timor-Leste Time
    • Easter Island Time vs Easter Time
    • Fernando de Noronha Time vs Noronha Time
    • French Southern & Antarctic Time vs French Southern Territories Time
    • Hong Kong Time vs Hong Kong SAR China Time
    • Indian Ocean Time vs British Indian Ocean Territory Time
    • Lanka Time vs Sri Lanka Time
    • Macao Time vs Macao SAR China Time
    • Myanmar Time vs Myanmar (Burma) Time
    • North Mariana Islands Time vs Northern Mariana Islands Time
    • Philippine Time vs Philippines Time
    • Pitcairn Time vs Pitcairn Islands Time
    • Ponape Time vs Pohnpei Time
    • South Georgia Time vs South Georgia & South Sandwich Islands Time
    • Wake Island Time vs Wake Time

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code seems fine but I'm not fully clear on the implications of this, defer to @sffc or other tz reviewers

@sffc
Copy link
Member

sffc commented Oct 30, 2024

Singapore Standard Time vs Singapore Time

Wikipedia says both names are valid: https://en.wikipedia.org/wiki/Singapore_Time

In general I think it's probably safest to only deduplicate in cases of an exact match, as much as I'd like to be more aggressive, and then discuss this list with CLDR to find more opportunities for deduplication.

@sffc
Copy link
Member

sffc commented Oct 30, 2024

Hong Kong SAR China Time
Myanmar (Burma) Time

These look pretty silly, raising questions about whether we want a Location format to ever be chosen, and instead push people toward an improved Generic Non-Location that supports Generic Partial Location.

.filter_map(|(region, value)| {
Some((
icu::locale::subtags::Region::try_from_str(region).ok()?,
value.as_str(),
))
})
// Overwrite with short names, as we want to use those
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Can you cite the algorithm where we want short names? I assume you did it maybe to normalize "Myanmar (Burma)" to "Myanmar" and things like that, but normalizing "United Kingdom" to "UK" might not be desirable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTS-35 does not specify which names to use. I think UK Time is better than United Kingdom Time, because it is shorter.

@robertbastian
Copy link
Member Author

These look pretty silly

Well they are Hong Kong Time and Macao Time as of the latest commit.

@robertbastian
Copy link
Member Author

In general I think it's probably safest to only deduplicate in cases of an exact match, as much as I'd like to be more aggressive, and then discuss this list with CLDR to find more opportunities for deduplication.

That's what I'm doing. These are observations that we need to bring to CLDR.

@robertbastian robertbastian requested a review from sffc October 30, 2024 17:01
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a CLDR issue to discuss with the CLDR Design WG, both the short vs long location names and the other issues you found.

@robertbastian robertbastian merged commit 8889471 into unicode-org:main Oct 30, 2024
28 checks passed
@robertbastian robertbastian deleted the mzdedupe branch October 30, 2024 17:30
@robertbastian
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants