Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only two character BCP47 language tags work, ISO-639-3 tags do not work. #108

Closed
MattMatic opened this issue Nov 21, 2024 · 12 comments · Fixed by harfbuzz/harfbuzz#4953
Closed

Comments

@MattMatic
Copy link
Contributor

Harfbuzzjs is behaving as if HB_NO_LANGUAGE_LONG is defined, and only 2 character BCP47 tags work.

Reproduce

  • NotoSansDevangari-Regular.ttf (FWIW I tried with older versions and the latest Google Font)
  • Text string: "५ल" (\u096B\u0932)
  • Script: "deva"
  • Language: try different languages...

The first glyph should change when language = "nep" (ISO-639-3) or "ne" (BCP47).
The second glyph should change when language = "mar" (ISO-639-3) or "mr" (BCP47).

However, harfbuzzjs falls back to default language when language string is set to "nep" or "mar"
It only works with "ne" or "mr".

Tools to test

Harfbuzz code

Looking at the HarfBuzz code, hb_ot_tags_from_language does the work, looking up the buffer's language tag from ot_languages2 to map the BCP47 tag to the OpenType tag. Only if HB_NO_LANGUAGE_LONG is undefined does it also try ot_languages3, and then if there is no match does it try taking the language string as ISO-639-3 converted to uppercase.

I haven't (yet) tried building hb.wasm...

@khaledhosny
Copy link
Contributor

hb_language_from_string() is documented to take only BCP 47 language tag as input, there is no mention of ISO-639-3 tags, so I’m not sure if setting language using ISO-639-3 tags is intentionally supported.

@MattMatic
Copy link
Contributor Author

MattMatic commented Nov 21, 2024

Harfbuzz has to support OpenType tags, as Uniscribe only uses OpenType and not BCP47.

I tested against a Windows HarfBuzz/Cairo test tool (built using out-the-box settings in vcpkg), plus a similar Uniscribe test tool, and harfbuzzjs. The Windows tool uses hb_language_from_string too.

Have I missed some other way of setting the buffer language?

@khaledhosny
Copy link
Contributor

Harfbuzz has to support OpenType tags, as Uniscribe only uses OpenType and not BCP47.

HarfBuzz does not take OpenType language tags or script tags as for buffer language and script, it takes BCP 47 for language, and ISO 15924 tags for script. That is an intentional design. Using OpenType language or script tags is a mistake (not an uncommon one) and it works sometimes by coincidence.

@MattMatic
Copy link
Contributor Author

Now I have to work out why "nep" and "mar" actually work in the HarfBuzz windows tool 🫢

BCP 47 + ISO 15294 make perfect sense.
Appreciate the clarification!

@MattMatic
Copy link
Contributor Author

MattMatic commented Nov 21, 2024

Update: just noticed that hb_ot_tags_from_language is deprecated. Oops. 😶

The Windows app is using HarfBuzz 8.3.0. Wondering why that appears to accept OpenType language tags the same as Uniscribe does..?

@simoncozens
Copy link
Collaborator

OK, I'm confused as well.

$ hb-shape examples/NotoSansDevanagari-VF.ttf --no-glyph-names '५ल'
[118=0+520|83=1+678]
$ hb-shape examples/NotoSansDevanagari-VF.ttf --no-glyph-names '५ल' --language=NEP
[123=0+520|83=1+678]
$ hb-shape examples/NotoSansDevanagari-VF.ttf --no-glyph-names '५ल' --language=ne
[123=0+520|83=1+678]
$ hb-shape examples/NotoSansDevanagari-VF.ttf --no-glyph-names '५ल' --language=x-hbot-4e455020 # "NEP "
[123=0+520|83=1+678]
  for (lang of ["", "NEP", "ne", "x-hbot-4E455020"]) {
    var buffer = hb.createBuffer();
    buffer.addText("५ल");
    buffer.setScript("Deva");
    buffer.setDirection("ltr");
    if (lang) buffer.setLanguage(lang);
    hb.shape(font, buffer);
    var result = buffer.json(font);
    console.log(result[0].g)
  }
118
118
123
118

@MattMatic
Copy link
Contributor Author

That's the same confusing result I'm getting here...
at the moment can't clearly see how it works in hb-shape et al, converting "ne" to the OpenType "NEP " language tag for lookups. 😵‍💫

@MattMatic
Copy link
Contributor Author

HB uses ot_languages2 and ot_languages3 to locate the OpenType table.

Another example is dty is a 3 letter form that maps to "NEP ".
i.e. it should map to gid 123 but does not.

It appears that 3 character BCP-47 tags are also not detected. Only 2 character BCP-47 work.

Those tables are only used in hb-ot-tag.cc, in hb_ot_tag_to_language and hb_ot_tags_from_language.
And the only way that ot_languages3 isn't reference is when HB_NO_LANGUAGE_LONG is defined. 🤔
And then also the handling for x-hbot-4e455020 style seems to be disabled too.

Have I missed something?

@khaledhosny
Copy link
Contributor

We define HB_TINY, which in turn defines HB_NO_LANGUAGE_LONG, which comes from harfbuzz/harfbuzz#3665, so it is a file size-saving measure at the expense of losing some functionality. I personally, don’t like when optimizations affect functionality, so I’m in preference of undermining HB_NO_LANGUAGE_LONG when building hb.wasm, but I’ll wait for @behdad’s opinion on this.

@behdad
Copy link
Member

behdad commented Nov 21, 2024

I'm fine removing this from HB_TINY. Thanks.

@khaledhosny
Copy link
Contributor

HB_NO_LANGUAGE_LONG is actually defined by HB_LEAN (which HB_TINY defines). It also defines HB_NO_LANGUAGE_PRIVATE_SUBTAG, which is probably why x-hbot-* tags are not working here as well. Since the private tags are needed to use OpenType language tags directly (which is the real issue here), should I remove HB_NO_LANGUAGE_PRIVATE_SUBTAG from HB_LEAN too, or only undefine it when building hb.wasm?

@behdad
Copy link
Member

behdad commented Nov 23, 2024

Fine with both. Sorry for the trouble.

khaledhosny added a commit to harfbuzz/harfbuzz that referenced this issue Nov 24, 2024
Remove HB_NO_LANGUAGE_LONG and HB_NO_LANGUAGE_PRIVATE_SUBTAG defines to
support language tags longer than 2 letters and private language tags
(needed to set language using OpenType language tags) respectively.

HB_LEAN is used when smaller binary size is desired, but in general it
should not produce different shaping output.

Fixes harfbuzz/harfbuzzjs#108
khaledhosny added a commit to harfbuzz/harfbuzz that referenced this issue Nov 24, 2024
Remove HB_NO_LANGUAGE_LONG and HB_NO_LANGUAGE_PRIVATE_SUBTAG defines to
support language tags longer than 2 letters and private language tags
(needed to set language using OpenType language tags) respectively.

HB_LEAN is used when smaller binary size is desired, but in general it
should not produce different shaping output.

Fixes harfbuzz/harfbuzzjs#108
behdad pushed a commit to harfbuzz/harfbuzz that referenced this issue Nov 25, 2024
Remove HB_NO_LANGUAGE_LONG and HB_NO_LANGUAGE_PRIVATE_SUBTAG defines to
support language tags longer than 2 letters and private language tags
(needed to set language using OpenType language tags) respectively.

HB_LEAN is used when smaller binary size is desired, but in general it
should not produce different shaping output.

Fixes harfbuzz/harfbuzzjs#108
khaledhosny added a commit that referenced this issue Nov 25, 2024
Switch HarfBuzz submodule to main branch to have the fix and add tests
for it.

Related to #108
khaledhosny added a commit that referenced this issue Nov 25, 2024
Switch HarfBuzz submodule to main branch to have the fix and add tests
for it.

Related to #108
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants