Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-14453 Adding IANA zone.tab mapping in timezone.xml #3105

Merged
merged 2 commits into from
Aug 10, 2023

Conversation

yumaoka
Copy link
Member

@yumaoka yumaoka commented Jul 19, 2023

CLDR-14453

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

Copy link
Contributor

@justingrant justingrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like great progress, thanks @yumaoka! Are there other changes you have planned beyond what's in here already?

Coincidentally, @anba and I are discussing whether the ECMAScript spec should refer to CLDR instead of IANA as a "source of truth" for which IDs are available in ECMAScript and which ones are canonical. See tc39/ecma402#806.

@anba - you may want to review this PR too.

<type name="ugkla" description="Kampala, Uganda" alias="Africa/Kampala"/>
<type name="umawk" description="Wake Island, U.S. Minor Outlying Islands" alias="Pacific/Wake"/>
<type name="umjon" description="Johnston Atoll, U.S. Minor Outlying Islands" alias="Pacific/Johnston"/>
<type name="umjon" description="Johnston Atoll, U.S. Minor Outlying Islands" deprecated="true" iana="ushnl"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iana="ushnl"

Did you mean preferred="ushnl" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. Fixed the error.

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@yumaoka yumaoka removed the incomplete label Aug 9, 2023
@yumaoka yumaoka changed the title CLDR-14453 [WIP] Adding IANA zone.tab mapping in timezone.xml CLDR-14453 Adding IANA zone.tab mapping in timezone.xml Aug 9, 2023
pedberg-icu
pedberg-icu previously approved these changes Aug 9, 2023
@@ -69,6 +69,9 @@ CLDR data files are interpreted according to the LDML specification (http://unic
<!ATTLIST type since CDATA #IMPLIED >
<!--@MATCH:version-->
<!--@METADATA-->
<!ATTLIST type iana CDATA #IMPLIED >
<!--@MATCH:any-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better as a regex that matched the structure. eg something like [A-Za-z_]+(/[A-Za-z_]+)*

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati Updated. Please check.

@anba
Copy link

anba commented Aug 9, 2023

Atlantic/Jan_Mayen is treated as a link to Arctic/Longyearbyen, but per IANA Atlantic/Jan_Mayen is either a link to Europe/Berlin, or a link to Europe/Oslo, or a proper zone.

@justingrant
Copy link
Contributor

I just noticed that WET, EET, CET, MET are canonical in IANA but are missing from timezone.xml. I assume these should be added?

Other than those, are there any other Zones in IANA that are missing from timezone.xml?

@justingrant
Copy link
Contributor

justingrant commented Aug 9, 2023

Atlantic/Jan_Mayen is treated as a link to Arctic/Longyearbyen, but per IANA Atlantic/Jan_Mayen is either a link to Europe/Berlin, or a link to Europe/Oslo, or a proper zone.

If the intent is for zones of the same country code (SJ in this case) to always share the same canonical ID, then resolving to Arctic/Longyearbyen does sound like the correct behavior here. Maybe this should prompt a change to the definition of which IDs that CLDR says are canonical? I wrote an initial guess at how this could be described. Would the rules below work?

CLDR canonical IDs are:

  1. All IDs that are Zones in IANA TZDB with default build options
  2. All IDs in zone.tab
  3. Etc/Unknown

All other Links in IANA must be an alias in CLDR. If the Link in IANA corresponds to a single ISO 3166-2 country code, then it must be an alias in CLDR to the ID listed in zone.tab for that country code. For example, Atlantic/Jan_Mayen should resolve to Arctic/Longyearbyen.

(EDIT: changed above to accommodate links that might not correspond to a country code, like Etc/Universal)

@yumaoka
Copy link
Member Author

yumaoka commented Aug 9, 2023

Atlantic/Jan_Mayen is treated as a link to Arctic/Longyearbyen, but per IANA Atlantic/Jan_Mayen is either a link to Europe/Berlin, or a link to Europe/Oslo, or a proper zone.

If the intent is for zones of the same country code (SJ in this case) to always share the same canonical ID, then resolving to Arctic/Longyearbyen does sound like the correct behavior here.

Right. This is the reason why it's not an alias of Europe/Berlin.

Maybe this should prompt a change to the definition of which IDs that CLDR says are canonical? I wrote an initial guess at how this could be described. Would the rules below work?

All IDs that are Zones in IANA TZDVB with default build options

This one is a bit tricky. I did not make EST as a canonical zone. It is currently defined as an alias of Etc/GMT+5 in CLDR timezone.xml at this moment. This is same for CST MST HST.

EST5EDT CST6CDT MST7MDT PST8PDT - these legacy zones are currently defined as "canonical", because these are purely artificial and different from other canonical zones associated with a region. For example, EST5EDT is different from America/New_York for dates before 1970. However, at least I thought it's not worth separating EST from Etc/GMT+5.

On the other hand, both Etc/UTC and Etc/GMT are CLDR canonical zones. Theoretically they are different, and not a small number of people argue they should be different. Unlike EST, it's not coming from legacy system requirements, CLDR handles them as separate zones.

I think it's probably easier to spell out all exceptions, because these exception for legacy zones probably won't change in future.

@@ -69,6 +69,9 @@ CLDR data files are interpreted according to the LDML specification (http://unic
<!ATTLIST type since CDATA #IMPLIED >
<!--@MATCH:version-->
<!--@METADATA-->
<!ATTLIST type iana CDATA #IMPLIED >
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@yumaoka
Copy link
Member Author

yumaoka commented Aug 9, 2023

I just noticed that WET, EET, CET, MET are canonical in IANA but are missing from timezone.xml. I assume these should be added?

Your point is right. I'm now wondering if it still make sense to keep these legacy zone IDs.

There is a problem for making CET as an alias of Europe/Berlin, because UTC offsets in these two zones were different. To make one zone as an alias of another zone, these two zones should have exact same UTC offset at any point of time.

CLDR is not the source of offset transition rules, but providing localized names. If we don't have a CLDR zone ID corresponding to IANA zone CET, CLDR cannot provide localized display names for this. And at this moment, CET is missing. Now, the question is whether we really need to assign any localized name for this zone.

If application depends on CLDR want to show localized name for time zone, such application would exclude these "legacy" zones. Such application may limit the set of zones from zone.tab. EST, CST6CDT, CET... these are not included in zone.tab.
CLDR also provides a way to display time zone based on UTC offset. So even a user selects CET as system time zone, and an application use the configuration to display date, it should be able to show at least something like "UTC+1:00".

For above reasons, I now think we probably don't need these legacy zones (EST5EDT, and some others) as a part of CLDR canonical set.

If we want a consistent policy, we have two options.

  • Add WET/CET/MET/EET to timezone.xml as canonical time zones. If we do so, all zone IDs available in the default IANA time zone build are covered, either as canonical zones or aliases of other zones in timezone.xml.
  • Deprecate EST5EDT/CST6CDT/MST7MDT. We may also drop aliases - EST/MST/HST.

Adding WET and some others is relatively easy. I will bring this question to CLDR team, and handle it in a separate PR.

@justingrant
Copy link
Contributor

justingrant commented Aug 10, 2023

There's one more oddball canonical ID: "Factory". Should we also omit that one? I admit that I don't know what that Zone is for. Do you?

For above reasons, I now think we probably don't need these legacy zones (EST5EDT, and some others) as a part of CLDR canonical set.

👍 I like the idea of removing weird legacy IDs like "PST8PDT", because it'd clean up the output of ECMAScript's Intl.supportedValuesOf('timeZone') which only returns canonical IDs. It'd also make it easier to make progress on tc39/ecma402#778 by limiting the cross-browser differences to Etc/GMT* and UTC.

I'd also want to hear what @sffc and @anba think about this proposal.

  • Add WET/CET/MET/EET to timezone.xml as canonical time zones. If we do so, all zone IDs available in the default IANA time zone build are covered, either as canonical zones or aliases of other zones in timezone.xml.
  • Deprecate EST5EDT/CST6CDT/MST7MDT. We may also drop aliases - EST/MST/HST.

I assume by "deprecate" you mean that we'd make those IDs into aliases of other canonical IDs. If this assumption is correct, then this seems reasonable to me.

I assume that single-offset POSIX names like CET would, per your comment above, be resolved to "Etc/GMT*" names, while the 4 multiple-offset POSIX names would be resolved to their appropriate counterparts like "America/New_York" or "America/Chicago"?

I think it's probably easier to spell out all exceptions, because these exception for legacy zones probably won't change in future.

I'm OK with spelling out exceptions, but I'd also like to (if possible) document general principles or rules that drive these exceptions. This will be helpful to explain the exceptions to others. Are the rules below (adapted from tc39/ecma402#825) an accurate way to document the changes planned? These include the exceptions noted above and my understanding of your proposed POSIX solution.


Each Zone in the IANA Time Zone Database must be primary in CLDR ("primary" means that it is either listed in an iana attribute, or if iana is not present then is listed first in the alias attribute) and each Link name in the IANA Time Zone Database must be a non-primary identifier that resolves to its corresponding Zone name, with the following exceptions:

  • Any Link name in the TZ column of zone.tab of the IANA Time Zone Database must be primiary.
  • Any Link name that represents a geographical area entirely contained within the territory of a single ISO 3166-2 country code must resolve to a primary identifier that also represents a geographical area entirely contained within the territory of the same ISO 3166-2 country code. For example, "Atlantic/Jan_Mayen" must resolve to "Arctic/Longyearbyen".
  • Legacy POSIX identifiers that refer to a single fixed UTC offset ("CET", "EET", "EST", "HST", "MET", "MST", and "WET") must be non-primary identifiers that resolve to Zone name, starting with "Etc/GMT", for the same UTC offset. For example, "EST" must resolve to "Etc/GMT+5".
  • Legacy POSIX identifiers that refer to US Zones with multiple UTC offsets ("EST5EDT", "CST6CDT", "MST7MDT", and "PST8PDT") must be non-primary identifiers that resolve to the most-populous US Zones that use the same pair of UTC offsets: "America/New_York", "America/Chicago", "America/Denver", and "America/Los_Angeles", respectively.
  • "Etc/Unknown" is a primary identifier that is used when a requested identifier is not present in the IANA Time Zone Database, for example because the identifier is misspelled or because the implementation has not been updated with the latest version of the IANA Time Zone Database.

(EDIT: changed "canonical" to "primary" in the text above to match ECMAScript terminology as well as terms used in https://unicode-org.atlassian.net/browse/ICU-22452)

@yumaoka
Copy link
Member Author

yumaoka commented Aug 10, 2023

@justingrant I will discuss with CLDR folks and decide what to do for CET, etc. I'm leaning toward to deprecate existing legacy IDs.
Because CLDR to ICU integration round 1 is starting, I want to introduce the structure now. I will merge the PR, then will make update for the outstanding issue.

The policy of maintaining CLDR canonical zones might be simply documented in the process doc - https://cldr.unicode.org/development/updating-codes/update-time-zone-data-for-zoneparser

@yumaoka yumaoka merged commit c25cebc into unicode-org:main Aug 10, 2023
@yumaoka yumaoka deleted the iana-zone-map branch August 10, 2023 18:55
@justingrant
Copy link
Contributor

Because CLDR to ICU integration round 1 is starting, I want to introduce the structure now. I will merge the PR, then will make update for the outstanding issue.

Do the bullet points at the end of #3105 (comment) match your opinion of how these CET, PST8PDT, etc. IDs should be mapped to primary IDs? If yes, then I'll update tc39/ecma402#825 to align with those bullet points.

@justingrant I will discuss with CLDR folks and decide what to do for CET, etc. I'm leaning toward to deprecate existing legacy IDs.

👍

The policy of maintaining CLDR canonical zones might be simply documented in the process doc - https://cldr.unicode.org/development/updating-codes/update-time-zone-data-for-zoneparser

👍

@justingrant
Copy link
Contributor

There's one more oddball canonical ID: "Factory". Should we also omit that one? I admit that I don't know what that Zone is for.

Here's more info about this Zone, from https://mm.icann.org/pipermail/tz/2023-August/033032.html:

There's a Zone called "Factory". What is its purpose, and how is it used?

It is intended for use as a factory default, to clearly indicate an
unconfigured system rather than one that is intentionally configured
to run on UTC.

@anba
Copy link

anba commented Aug 15, 2023

I assume that single-offset POSIX names like CET would, per your comment above, be resolved to "Etc/GMT*" names, while the 4 multiple-offset POSIX names would be resolved to their appropriate counterparts like "America/New_York" or "America/Chicago"?

As mentioned in #3105 (comment), WET, CET, MET, and EET aren't fixed offset time zones → https://github.com/eggert/tz/blob/c3e966c59b02b1f47f0b7b0e4aa6a86563c07062/europe#L736-L739.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants