Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Myanmar] Syllable matching and punctuation #164

Open
wezm opened this issue Jun 21, 2024 · 28 comments
Open

[Myanmar] Syllable matching and punctuation #164

wezm opened this issue Jun 21, 2024 · 28 comments
Assignees

Comments

@wezm
Copy link

wezm commented Jun 21, 2024

I'm working on Myanmar shaping in Allsorts and have a query about how punctuation should be handled in syllable splitting. There are these punctuation characters in the Myanmar character tables but they don't seem to be matched by any rules.

Codepoint Unicode category Shaping class Mark-placement subclass Glyph
U+104A Punctuation null null ၊ Little Section
U+104B Punctuation null null ။ Section
U+104C Punctuation null null ၌ Locative
U+104D Punctuation null null ၍ Completed
U+104F Punctuation null null ၏ Genitive

I've run my implementation against this text "ပို၍စောစီးစွာပေးပါက" and ၍ is tripping it up. It has no shaping class/rules that match it in the syllable identification details.

There are these two notes though:

Assigned codepoints with a null in the Shaping class column evoke no special behavior from the shaping engine.

and

A sequence that does not match any of these expressions should be regarded as broken. The shaping engine may make a best-effort attempt to shape the broken sequence, but making guarantees about the correctness or appearance of the final result is out of scope for this document.

I'm wondering how these characters should be handled, since their use doesn't feel like a broken expression?

One other note: ။ and ၊ are referenced in the non-terminal _punc_ = "Little Section" | "Section" however punc` does not appear to be used, wondering if that's intended?

Edit: I see the following on the OpenType Myanmar page:

Simple non-compounding cluster

<P | S | R | WJ| WS | O | D0 >

Punctuation (P), symbols (S), reserved characters from the Myanmar block (R), word joiner (WJ), white space (WS), and other SCRIPT_COMMON charcters (O) contain one character per cluster.

Which suggests ၍ and friends should be accepted as cluster by themselves.

@n8willis
Copy link
Owner

Taking a look now! Thanks for the report & detail here; it's not a page that I think a lot of third-party readers have gone through yet....

@n8willis
Copy link
Owner

So, just briefly, HarfBuzz merges the punc class in with the generic bases (gb), which would allow them to also match the more complex syllable expressions; it also merges U+104C-104F into a single syallable-modifier / bindu class that includes several things, like Shan tones, that are treated distinctly in the official MS / OTL docs. Those, therefore, match in expressions that are defined for the Shan tones and other modifiers, and don't match where the "symbol" class would (as standalone).

It's not clear to me yet if there is a need for that, or if it got rolled in for simplification. There are several issue threads / discussions from c. 2022 where the original Myanmar shaper in HarfBuzz was getting refined to be more robust (it was originally based on the Indic2 shaper, AIUI) and some of that work involved trying to trim down the overall number of codepoint classes, which was high in comparison to some of its neighbors.

I've found a few language sources to poke into if I can get my head around them, though. Because, to be honest, I started to wonder if Unicode really got it right with calling U+104C-F "punctuation" in the first place. HarfBuzz merging those in with syllable modifiers sounds more like a reasonable re-classification, rather than a "byte-saving optimization"....

@n8willis n8willis self-assigned this Jun 25, 2024
@wezm
Copy link
Author

wezm commented Jun 25, 2024

Thanks for looking into it

@wezm
Copy link
Author

wezm commented Jun 26, 2024

Another query/thing I've run into. In the stage 2, initial reordering step some characters aren't being tagged such as those with the NUMBER category and some punctuation like hypen and en dash.

@n8willis
Copy link
Owner

As a side note, would it be more useful to drop the usage of the term "syllable"? E.g., in favor of something more technically precise, like "cluster"?

I probably went with syllable initially for reasons of new-reader-familiarity, but that does come at a cost....

@wezm
Copy link
Author

wezm commented Jun 26, 2024

As a side note, would it be more useful to drop the usage of the term "syllable"? E.g., in favor of something more technically precise, like "cluster"?

I don't have strong feelings one way or another but https://learn.microsoft.com/en-us/typography/script-development/myanmar#analyzing-the-characters uses "syllable clusters", "character clusters", and just plain "cluster' so perhaps cluster is the more consistent choice.

@wezm
Copy link
Author

wezm commented Jul 5, 2024

I think there's an omission in the matching rules. _sm_ is unreferenced. I think that _v_* in:

Tcomplex= _asat_* Med Vmain Vpost* Pwo* _v_* Z?

should be something like (_v_ | _sm_)* to give:

Tcomplex= _asat_* Med Vmain Vpost* Pwo* (_v__sm_)* Z?

For this example "င်္က္ကျြွှေို့်ာှီ့ၤဲံ့းႍ" this change would allow the last character to be matched, which it does not currently:

        | U+1004 | Letter    | CONSONANT         | _null_                       | Nga                    |  _ra_         ⎫
        | U+103A | Mark [Mn] | PURE_KILLER       | TOP_POSITION                 | Asat                   |  _asat_       ⎬ Kinzi (K)
        | U+1039 | Mark [Mn] | INVISIBLE_STACKER | _null_                       | Virama                 |  _halant_     ⎭
        | U+1000 | Letter    | CONSONANT         | _null_                       | Ka                     |  C
        | U+1039 | Mark [Mn] | INVISIBLE_STACKER | _null_                       | Virama                 |  _halant_
        | U+1000 | Letter    | CONSONANT         | _null_                       | Ka                     |  C
        | U+103B | Mark [Mc] | CONSONANT_MEDIAL  | RIGHT_POSITION               | Sign Medial Ya         |  _my_         ⎫
        | U+103C | Mark [Mc] | CONSONANT_MEDIAL  | TOP_LEFT_AND_BOTTOM_POSITION | Sign Medial Ra         |  _mr_         ⎬ Med
        | U+103D | Mark [Mn] | CONSONANT_MEDIAL  | BOTTOM_POSITION              | Sign Medial Wa         |  _mw_         ⎟
        | U+103E | Mark [Mn] | CONSONANT_MEDIAL  | BOTTOM_POSITION              | Sign Medial Ha         |  _mh_         ⎭
        | U+1031 | Mark [Mc] | VOWEL_DEPENDENT   | LEFT_POSITION                | Sign E                 |  _matrapre_   ⎫
        | U+102D | Mark [Mn] | VOWEL_DEPENDENT   | TOP_POSITION                 | Sign I                 |  _matraabove_ ⎟
        | U+102F | Mark [Mn] | VOWEL_DEPENDENT   | BOTTOM_POSITION              | Sign U                 |  _matrabelow_ ⎬ Vmain
        | U+1037 | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Dot Below              |  _db_         ⎟
        | U+103A | Mark [Mn] | PURE_KILLER       | TOP_POSITION                 | Asat                   |  _asat_       ⎭
        | U+102C | Mark [Mc] | VOWEL_DEPENDENT   | RIGHT_POSITION               | Sign Aa                |  _matrapost_  ⎫
        | U+103E | Mark [Mn] | CONSONANT_MEDIAL  | BOTTOM_POSITION              | Sign Medial Ha         |  _mh_         ⎬ Vpost
        | U+102E | Mark [Mn] | VOWEL_DEPENDENT   | TOP_POSITION                 | Sign Ii                |  _matraabove_ ⎟
        | U+1037 | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Dot Below              |  _db_         ⎭
        | U+1064 | Mark [Mc] | TONE_MARKER       | RIGHT_POSITION               | Tone Sgaw Karen Ke Pho |  _pt_         ⎫
        | U+1032 | Mark [Mn] | VOWEL_DEPENDENT   | TOP_POSITION                 | Sign Ai                |  _a_          ⎟
        | U+1036 | Mark [Mn] | BINDU             | TOP_POSITION                 | Anusvara               |  _a_          ⎬ Pwo
        | U+1037 | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Dot Below              |  _db_         ⎟
        | U+1038 | Mark [Mc] | VISARGA           | RIGHT_POSITION               | Visarga                |  _v_          ⎭
        | U+108D | Mark [Mn] | TONE_MARKER       | BOTTOM_POSITION              | Sign Shan Council Emphatic Tone|

@n8willis
Copy link
Owner

n8willis commented Jul 9, 2024

That does look correct; I am trying to untangle some other differences between the MS and HB regex categories, though. Sorry I've been less responsive here for a bit; just juggling some other things. Hope to have an update worth looking at shortly. I just don't want to mangle some of the changed bits without understanding why some of the other shapers are doing something different than they did when this was written.

@n8willis
Copy link
Owner

Question: How does the Allsorts team view the combining of categories, in general? The fact that HarfBuzz does that is one of the reasons it can take a minute to get back up to speed when comparing its regular expressions to the MS script docs's (which don't do that).

It's certainly practical for implementers, no doubt. There might be some middle-of-the-road approach for documenting, like just combining classes that are purely sets of individual characters, but not combining sets of expressions unless they really simplify the final syllable/cluster-matching expressions.

@wezm
Copy link
Author

wezm commented Jul 11, 2024

Question: How does the Allsorts team view the combining of categories, in general?

Do you mean things like this?

_consonant_ 	= `CONSONANT` | `CONSONANT_PLACEHOLDER` - _ra_

and 

C	= _consonant_ | _ra_

If so, I'm not sure there are strong feelings one way or another. It probably does help the implementation be a bit more readable.

@n8willis
Copy link
Owner

Yeah, a more apropos phrasing would probably be just saying "if any of the category combining causes trip-ups or is confusing, please consider that a bug". In particular, here I was wondering about the subtraction of the _ra_ class from the _consonant_ class. I think that might be the only place I attempted a "difference" operator; coming back to it after some time had elapsed I can't recall why that seemed like a good idea.

@n8willis
Copy link
Owner

So, there are going to be a couple of changes required for sure. One is that the regular Visarga codepoint actually matches both the _v_ and _sm_ sets, which is somewhat harmless but a bit confusing for the reader. So I would just drop _v_ and put in a category match in the _sm_ definition to capture the other VISARGA-class codepoint(s). So, in your example above, you could just use _sm_ and not worry about the OR.

The one place where I'm not sure that handling explicit Visarga with other _sm_ codepoints wouldn't cause problems is in Sanskrit, because the Vedic Extensions has some overstruck visarga-related signs, and I don't know if those are classified right in the table. Although Harfbuzz seems not to worry about that....

HarfBuzz also updated its medial logic in response to w3c/font-text-cg#43 (comment) to separate Medial Mon La, which is currently grouped in with Medial Ha, but can behave differently. I think that would just be as simple as a _ml_ = U+1060 and changing Med to _my_? _asat_? _mr_? ( (_mw_ _mh_? _ml_? | _mh_ _ml_? | _ml_) _asat_?)?

And I also think I should revisit the merging of some of the non-complex character sets; since _punc_ is unused that should be fixed etc. The MS docs have a few more things that are classified as PUNCTUATION here in with Symbols, but HarfBuzz considers them Generic Bases. Probably doesn't matter, but it might be easier reading with a little cleanup.

@wezm
Copy link
Author

wezm commented Jul 15, 2024

here I was wondering about the subtraction of the _ra_ class from the _consonant_ class. I think that might be the only place I attempted a "difference" operator; coming back to it after some time had elapsed I can't recall why that seemed like a good idea.

I will admit that I missed the subtraction initially. Also it's a little curious that _consonant_ subtracts _ra_ but _consonant_ isn't used aside from in C, which adds _ra_ back.

@n8willis
Copy link
Owner

I will admit that I missed the subtraction initially. Also it's a little curious that consonant subtracts ra but consonant isn't used aside from in C, which adds ra back.

I got that initially from the Microsoft documentation and, at the time, HarfBuzz was following it perhaps a bit more closely.

I think that the intent was likely that you would need to have a different definition of "consonant" for the regular expressions than you would use within a consonant-based syllable as you're identifying the base in shaping-stage 1. But that might not be necessary anymore (and perhaps was not necessary then, either...). But I'm rereading it again now to see if I still get it. Since the Kinzi sequence is not ambiguous, the base-finding algorithm can just match it without needing different classes. The consonant-placeholders I'm not as sure about, though.

@n8willis
Copy link
Owner

@wezm I pushed an update to the identification classes and regular expressions in stage-1, in PR #168 . When you have a moment, please take a look at let me know. I retained the _v_ class for VISARGA and merged it in with the existing _sm_ class, rather than just adding visarga to the _sm_ group, because of the Vedic Extensions visargas.

I also added a merge class _G_ that lumps the punctuation class in with the generic bases and the digits; that's the route that HarfBuzz takes as well, and it's slightly simpler than treating the punctuation separately.... Since Microsoft has a different split of symbol vs punctuation, I figure if the simple approach works, it's worth a try.

@wezm
Copy link
Author

wezm commented Sep 4, 2024

I haven't attempted an implementation yet but the changes look good.

@wezm
Copy link
Author

wezm commented Sep 5, 2024

I'm working on updating my implementation. One thing I encountered (this isn't new) is in:

All of the left-side dependent-vowel (matra) signs matching this condition in Myanmar can be identified using the matrapre regular-expression class defined in stage 1.

_matrapre_ is defined as _matrapre_ = MATRA&LEFT_POSITION`` however MATRA (as a class) isn't defined anywhere. Does it equate to the `VOWEL_DEPENDENT` shaping class in the character tables? (that's what I've been doing so far).

@wezm
Copy link
Author

wezm commented Sep 11, 2024

Some more notes as I progress the implementation:

I noticed a difference between our output and Harfbuzz. This was due to Harfbuzz applying rlig by default. I note that in the default shaping model docs this is enabled as part of the default set. Perhaps it should be added to the Myanmar docs too.

_punc_ only includes Section and Little section. Should it also include other characters in the other characters in the PUNCTUATION shaping class Unicode category listed in the Myanmar character tables? Such as:

  • ၌ Locative
  • ၍ Completed
  • ၎ Aforementioned
  • ၏ Genitive

@wezm
Copy link
Author

wezm commented Sep 11, 2024

Should characters with shaping class NUMBER be able to be base?

Example: ႐ုံ

  • U+1090 // MYANMAR SHAN DIGIT ZERO
  • U+102F // MYANMAR VOWEL SIGN U, Mark, Bottom
  • U+1036 // MYANMAR SIGN ANUSVARA, Mark, Top

U+1090 is matched by (C | _vowel_ | G) as part of G. Given the position in the expression it seems like the intent is that all these can be the base consonant. However in the reordering step only characters with shaping class CONSONANT are considered when determining what to assign POS_BASE_CONSONANT to.

@n8willis
Copy link
Owner

Some more notes as I progress the implementation:

I noticed a difference between our output and Harfbuzz. This was due to Harfbuzz applying rlig by default. I note that in the default shaping model docs this is enabled as part of the default set. Perhaps it should be added to the Myanmar docs too.

Yeah; if it's found in the text corpus then I'd concur that it ought to be documented (and for the other shapers, too). I think it feels off to document some of those other always-on-in-HB features as necessary, though, in particular the ones that deal with cursive styling. I guess the argument is that if the type designer puts a curs or calt feature in, then it's most likely necessary for getting the correct output back out.

Because rlig and rclt are on-by-default in the MS spec and not meant to be exposed in the UI, they pretty much have to be handled in the default path as non-optional. But it feels to me like it strays somewhat from the mission of describing how script-specific shaping is performed to simply drop in a list of features to be applied universally in just-in-case fashion. But maybe I'm overthinking it; adding a Note: to explain it might be all that's required.

@n8willis
Copy link
Owner

_punc_ only includes Section and Little section. Should it also include other characters in the other characters in the PUNCTUATION shaping class Unicode category listed in the Myanmar character tables? Such as:

* ၌ Locative

* ၍ Completed

* ၎ Aforementioned

* ၏ Genitive

I'm looking into this. The ၎ Aforementioned seems to be a different animal, at least in Burmese, but I'd like to get a bit of info about some of the other languages.

@n8willis
Copy link
Owner

Should characters with shaping class NUMBER be able to be base?

Example: ႐ုံ

* U+1090 // MYANMAR SHAN DIGIT ZERO

* U+102F // MYANMAR VOWEL SIGN U, Mark, Bottom

* U+1036 // MYANMAR SIGN ANUSVARA, Mark, Top

U+1090 is matched by (C | _vowel_ | G) as part of G. Given the position in the expression it seems like the intent is that all these can be the base consonant. However in the reordering step only characters with shaping class CONSONANT are considered when determining what to assign POS_BASE_CONSONANT to.

This is a tricky one. I believe that the inclusion of numbers in the syllable expression is needed in order to handle ordinal-number sylllables (1st, etc). But I wouldn't expect that a numeral would get medial or subjoined consonants attached.

@n8willis
Copy link
Owner

Should characters with shaping class NUMBER be able to be base?

Okay; it seem like it is a definite yes that a NUMBER can serve as a base, in order to handle quasi-word clusters like numerical quantities with units attached.

Similarly, U+104C, U+104D, and U+104F need to be able to take some marks (for tones, primarily), and U+104E can take those but also is commonly (possibly always) found with Asat.

So U+104C/D/F should likely go into the _gb_ basic class. That ends up combining with _punc_ anyway, but I think that's because HarfBuzz is trying to be broad.

@wezm if you have the time, it would be nice to look at the corpus and pull out how many U+104E do or don't get an Asat right after them.

I'm told that U+104C/D/F are historic and mainly found in manuscripts, so there may not be a lot of examples to find. But if U+104F, Asat needs to be considered broken, it'd be marginally better to preserve the distinction between C/D/F and U+104E in case there's a clarification later.

A second question the corpus might could shed light on is how any multi-digit numbers there are that exist in "word like" expressions, with full letters or marks appended. For me it's kind of still an open question how a 1234 sequence with a unit abbreviation or a mark would be found in logical order.

So a second corpus-mining request would be any examples of multi-digit numbers that have letters or marks. The wiser minds I have been learning from say that U+102D is the most widespread, so if there are a ton of those in particular, weeding them out to just take a look at the other cases would be totally cool.

I'm still leaning towards preserving _d_ as its own basic group, both because it's possible the rules for how number+abbreviations work gets clarified and because it might be handy for implementers doing other things for numbers.

@n8willis
Copy link
Owner

So, a simple search of good.my in https://github.com/yeslogic/corpus/ turns up quite a few tokens that involve number+mark sequences, like ၉့ (Nine,DotBelow) and ၀ှ (Zero, Medial Ha).

I don't find those in the Google corpuscrawler wordcount file, though. And I'm still unsure if all of the "unit" cases are suffixes. Or, perhaps more precisely, if the "unit" cases attach to the least-significant digit.

Still, good to know that medial consonants and _db_ matter for this case.

@n8willis
Copy link
Owner

I've also encountered a number of tokens that have Mynamar Digit Zero (U+1040) within a word, where it seems like there is a good chance that character might have been a typo for Wa (U+101D), considering the similarity in shape: ၀ vs ဝ.

I think Google Translate autocorrects some of those, because some "zero"-containing words have translations, but if there are any full word-pairs that differ only by the letter being ၀ or ဝ, that would be useful to know.

@n8willis
Copy link
Owner

(clarifying the above: that's in the corpuscrawler count. The "my / Burmese" file in the list here (1007K, so not linking directly): https://github.com/google/corpuscrawler )

@n8willis
Copy link
Owner

quite a few tokens that involve number+mark sequences, like ၉့ (Nine,DotBelow) and ၀ှ (Zero, Medial Ha).

Tangentially, it would also be informative to catalogue any sequences that are "Number, two or more Marks", which I don't think I've found in the frequency lists, and Number,Mark sequences that involve a Medial Ra (there certainly might not be any abbreviations/units with Medal Ra in them, but how the reordering and "enclosing" is handled could be illuminating).

@wezm
Copy link
Author

wezm commented Oct 29, 2024

Just catching up on things. It's a bit unclear, what's currently needed from me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants