Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No mechanism to indicate what the "default language" of a description is [I18N] #635

Closed
aphillips opened this issue May 7, 2019 · 15 comments
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. Needs review Issue was fixed, but is still open for post-merge reviews

Comments

@aphillips
Copy link

Section 5.2.1 "Thing"
https://cdn.staticaly.com/gh/w3c/wot-thing-description/TD-TAG-review/index.html?env=dev#thing

Provides additional (human-readable) information based on a default language.

The optional field description is described as above, but there appears to be no mechanism defined for declaring what language the "default language" is. It is possible that the JSON-LD @context mechanism could be used to supply an @language for a description. If that is the preferred or intended mechanism, it should be called out. Otherwise there should be mechanism, possibly at the document level, for declaring the default language using a BCP47 language tag.

@aphillips aphillips changed the title No mechanism to indicate what the "default language" of a description is No mechanism to indicate what the "default language" of a description is [I18N] May 7, 2019
@mkovatsc
Copy link
Contributor

mkovatsc commented May 7, 2019

Thank you for your review! We have been working hard recently and updated the spec already to document the @language mechanism accordingly.

We also added text on the possibility to use content negotiation such as the Accept-Language header field of HTTP.

I will cite the assertions in this Issue.

@mkovatsc
Copy link
Contributor

mkovatsc commented May 7, 2019

We need need some rewrite of the text after the table in 5.3,1,1 Thing. I started sketching the new text, statement by statement:

The @context name-value pair MUST contain the string https://www.w3.org/2019/td/v1 either directly when of type string or as first element when of type Array.

When @context is an Array, the string https://www.w3.org/2019/td/v1 MAY be followed by elements of type anyUri or Map in any order,

Maps contained in an @context Array MAY have name-value pairs,
where the value is a namespace IRI of type anyURI and the name a Term or prefix defined for that namespace,
while it is RECOMMENDED to include only one Map in the Array that holds all defined name-value pairs.

One Map contained in an @context Array SHOULD contain a name-value pair,
where the name is @language and the value a well-formed language tag as defined by [[!BCP47]],
which defines the default language for the Thing Description instance.

The default language is used to compute the base direction for all human-readable values except for MultiLanguage Maps:

...continue with the bullet point list

@aphillips
Copy link
Author

The default language is used to compute the base direction for all human-readable values except for MultiLanguage Maps:

While this reflects the current state of affairs in JSON standards-based document formats, it's not a particularly desirable recommendation and the I18N WG is actively working to find a "better path". In addition, I'd point out that "compute the base direction" needs a definition. Therefore I'd suggest that you reference our document String-Meta, particularly the best practice documented at #script_subtag, which describes how one would do this. In keeping with the allowed-but-not-loveable nature, I'd suggest:

The default language MAY be used to compute the base direction [[String-Meta]] for human-readable text values not otherwise associated with a language tag (such as MultiLanguage Maps).

@mkovatsc
Copy link
Contributor

mkovatsc commented May 8, 2019

See #643 (comment) which is how it continues (indicated by "...continue with the bullet point list")

@mmccool
Copy link
Contributor

mmccool commented May 8, 2019

One issue is that we expect the JSON-LD 1.1 WG to add some means to specify text direction explicitly in an @context, perhaps with the addition of an @dir tag. That would be great, but unfortunately it does not exist yet and the JSON-LD 1.1 draft spec explicitly calls out that it does not currently provide a means to explicitly specify text direction. However, we don't want to define our own way of doing it since then we would be in potential conflict with whatever the JSON-LD 1.1 group decides, and we want TDs to work with generic JSON-LD processors. So, one reason we decided on the "infer from a language tag with a script subtag as necessary" approach is that if there is a way to specify text direction explicitly as metadata in the final JSON-LD 1.1 standard, we can allow it (and give it priority if that metadata is present) and update our spec to just use the current infer-from-the-language-and-script-tags as a fallback plan.

@mmccool mmccool added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label May 8, 2019
mkovatsc added a commit that referenced this issue May 8, 2019
@mkovatsc mkovatsc added Needs review Issue was fixed, but is still open for post-merge reviews and removed PR needed labels May 8, 2019
@mkovatsc
Copy link
Contributor

mkovatsc commented May 8, 2019

@aphillips , please have a look at the new definitions in 5.3.1.1 Thing (after the table) and 5.3.1.7 MultiLanguage.

@aphillips
Copy link
Author

@mkovatsc My bad for not looking at the bulleted list.

@mmccool I fully agree with your comment and appreciate the care the WG applied here.

@aphillips
Copy link
Author

In 5.3.1.1 I see:

where the name is the Term @language and the value a well-formed language tag as defined by [BCP47], potentially including a script subtag (e.g., en, de, ja, zh-Hans, zh-Hant, az-Arab).

The call out about script subtags seems overly specific. Do you really need to call that out? It's on your mind now because of the thread about direction, but I think it's a distraction. I would also show some examples with region subtags and maybe even a variant. Perhaps:

where the name is the Term @language and the value a well-formed language tag as defined by [BCP47] (e.g., en, de-AT, gsw-CH, zh-Hans, zh-Hant-HK, sl-nedis).

@aphillips
Copy link
Author

The direction computing stuff in 5.3.1.1 I have these comments:

  1. approach is misspelled.
  2. The following quoted text forces a default of LTR. I would probably encourage using "first strong" detection instead:

If no language tag is given, the base direction MUST be assumed to be LTR (left-to-right). This implies that if the language used in human-readable text uses a script that is written RTL (right-to-left), the default language needs to be specified explicitly, so that an appropriate base direction can be inferred.

  1. I would reduce the MUST to SHOULD. I would also allow CLDR's "likely subtag" algorithm to be used. This is especially helpful for Chinese cases where some systems use region subtags to imply the script (e.g. zh-CN => zh-Hans-CN etc.). Note that Azerbaijani is (if rarely these days) also written in Cyrillic (az-Cyrl).

In cases where a language can be written in more than one script with different base directions, the corresponding language tag given in @language or MultiLanguage Maps MUST include a script subtag, so that an appropriate base direction can be inferred. An example is Azeri, which is written LTR when Latin script is used (specified using az-Latn) and RTL when Arabic script is used (specified using az-Arab).

  1. I think the following recommendations are counter productive. I like that you point out the problem, but the types of strings used here might very naturally include brand names, trademarks, version numbers, etc.

TD Processors should also be aware of certain special cases that can arise in processing bidirectional text. In particular, producers of TDs should avoid numbers with embedded spaces in bidirectional text. Strings starting with embedded text using a script with a writing direction opposite to that of the base direction (for example, English words embedded in Arabic text) or with multidigit numbers should be avoided if possible.

I would instead provide guidance to producers and consumers, perhaps as follows:

TD Processors should be aware of certain special cases when processing bidirectional text. They should take care to use bidi isolation when presenting strings to users, particularly when embedding in surrounding text. Mixed direction text can occur in any language, even when the language is properly identified.

TD producers should attempt to provide mixed direction strings in a way that can be displayed successfully by a naive user agent. For example, if an RTL string begins with an LTR run (such as a number or a brand or trade name in Latin script), including an RLM character at the start of the string or wrapping opposite direction runs in bidi controls can assist in proper display.

@r12a any comments?

@aphillips
Copy link
Author

On 5.3.1.7:

  1. Same comment about script subtags as I mentioned above.
  2. Should there be a requirement that language tags not be repeated? (e.g. you can't have two strings with the tag en-GB)

@mkovatsc
Copy link
Contributor

mkovatsc commented May 8, 2019

Pushing an update with your proposed changes in 5 min.

@mmccool
Copy link
Contributor

mmccool commented May 9, 2019

  • At least should add an assertion that if the script subtag is NOT necessary it should not be included
  • SHOULD use strong first in the absence of other information
  • Coordinate this with MAY use strong-first even if have language information or can't use it (eg on a constrained device).
  • Remove "avoid strings starting with ..."; make sure String-Meta is referenced as a guide

Look again at this suggestion:
I would instead provide guidance to producers and consumers, perhaps as follows:

TD Processors should be aware of certain special cases when processing bidirectional text. They should take care to use bidi isolation when presenting strings to users, particularly when embedding in surrounding text. Mixed direction text can occur in any language, even when the language is properly identified.

TD producers should attempt to provide mixed direction strings in a way that can be displayed successfully by a naive user agent. For example, if an RTL string begins with an LTR run (such as a number or a brand or trade name in Latin script), including an RLM character at the start of the string or wrapping opposite direction runs in bidi controls can assist in proper display.

@mkovatsc
Copy link
Contributor

mkovatsc commented May 9, 2019

Remove "avoid strings starting with ..."; make sure String-Meta is referenced as a guide

Note that I removed the critical text on avoiding some text constructs already. We have this NOTE, which should be replaced by the processor/producer paragraphs, I guess:

Great care has to be given when assigning bidirectional text to the human-readable metadata of Thing Descriptions. Producers of such texts are advised to include bidi controls as appropriate to try to ensure proper display. Consumers of such texts are advised to apply bidi isolation when including human-readable metadata of TDs in other text (e.g., for Web user interface). Strings on the Web: Language and Direction Metadata [string-meta] provides some guidance and illustrates a number of pitfalls when using bidirectional text.

@aphillips
Copy link
Author

I've reviewed all of the edits pertaining to the original issue here (defining the default language) and the various other suggestions on this thread. I'm satisfied with the results.

@sebastiankb
Copy link
Contributor

thank you for your feedback. I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. Needs review Issue was fixed, but is still open for post-merge reviews
Projects
None yet
Development

No branches or pull requests

4 participants