-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Myanmar] Syllable matching and punctuation #164
Comments
Taking a look now! Thanks for the report & detail here; it's not a page that I think a lot of third-party readers have gone through yet.... |
So, just briefly, HarfBuzz merges the punc class in with the generic bases (gb), which would allow them to also match the more complex syllable expressions; it also merges U+104C-104F into a single syallable-modifier / bindu class that includes several things, like Shan tones, that are treated distinctly in the official MS / OTL docs. Those, therefore, match in expressions that are defined for the Shan tones and other modifiers, and don't match where the "symbol" class would (as standalone). It's not clear to me yet if there is a need for that, or if it got rolled in for simplification. There are several issue threads / discussions from c. 2022 where the original Myanmar shaper in HarfBuzz was getting refined to be more robust (it was originally based on the Indic2 shaper, AIUI) and some of that work involved trying to trim down the overall number of codepoint classes, which was high in comparison to some of its neighbors. I've found a few language sources to poke into if I can get my head around them, though. Because, to be honest, I started to wonder if Unicode really got it right with calling U+104C-F "punctuation" in the first place. HarfBuzz merging those in with syllable modifiers sounds more like a reasonable re-classification, rather than a "byte-saving optimization".... |
Thanks for looking into it |
Another query/thing I've run into. In the stage 2, initial reordering step some characters aren't being tagged such as those with the |
As a side note, would it be more useful to drop the usage of the term "syllable"? E.g., in favor of something more technically precise, like "cluster"? I probably went with syllable initially for reasons of new-reader-familiarity, but that does come at a cost.... |
I don't have strong feelings one way or another but https://learn.microsoft.com/en-us/typography/script-development/myanmar#analyzing-the-characters uses "syllable clusters", "character clusters", and just plain "cluster' so perhaps cluster is the more consistent choice. |
I think there's an omission in the matching rules.
should be something like
For this example "င်္က္ကျြွှေို့်ာှီ့ၤဲံ့းႍ" this change would allow the last character to be matched, which it does not currently:
|
That does look correct; I am trying to untangle some other differences between the MS and HB regex categories, though. Sorry I've been less responsive here for a bit; just juggling some other things. Hope to have an update worth looking at shortly. I just don't want to mangle some of the changed bits without understanding why some of the other shapers are doing something different than they did when this was written. |
Question: How does the Allsorts team view the combining of categories, in general? The fact that HarfBuzz does that is one of the reasons it can take a minute to get back up to speed when comparing its regular expressions to the MS script docs's (which don't do that). It's certainly practical for implementers, no doubt. There might be some middle-of-the-road approach for documenting, like just combining classes that are purely sets of individual characters, but not combining sets of expressions unless they really simplify the final syllable/cluster-matching expressions. |
Do you mean things like this?
If so, I'm not sure there are strong feelings one way or another. It probably does help the implementation be a bit more readable. |
Yeah, a more apropos phrasing would probably be just saying "if any of the category combining causes trip-ups or is confusing, please consider that a bug". In particular, here I was wondering about the subtraction of the |
So, there are going to be a couple of changes required for sure. One is that the regular Visarga codepoint actually matches both the The one place where I'm not sure that handling explicit Visarga with other HarfBuzz also updated its medial logic in response to w3c/font-text-cg#43 (comment) to separate Medial Mon La, which is currently grouped in with Medial Ha, but can behave differently. I think that would just be as simple as a And I also think I should revisit the merging of some of the non-complex character sets; since |
I will admit that I missed the subtraction initially. Also it's a little curious that |
I got that initially from the Microsoft documentation and, at the time, HarfBuzz was following it perhaps a bit more closely. I think that the intent was likely that you would need to have a different definition of "consonant" for the regular expressions than you would use within a consonant-based syllable as you're identifying the base in shaping-stage 1. But that might not be necessary anymore (and perhaps was not necessary then, either...). But I'm rereading it again now to see if I still get it. Since the Kinzi sequence is not ambiguous, the base-finding algorithm can just match it without needing different classes. The consonant-placeholders I'm not as sure about, though. |
@wezm I pushed an update to the identification classes and regular expressions in stage-1, in PR #168 . When you have a moment, please take a look at let me know. I retained the I also added a merge class |
I haven't attempted an implementation yet but the changes look good. |
I'm working on updating my implementation. One thing I encountered (this isn't new) is in:
|
Some more notes as I progress the implementation: I noticed a difference between our output and Harfbuzz. This was due to Harfbuzz applying
|
Should characters with shaping class Example: ႐ုံ
U+1090 is matched by |
Yeah; if it's found in the text corpus then I'd concur that it ought to be documented (and for the other shapers, too). I think it feels off to document some of those other always-on-in-HB features as necessary, though, in particular the ones that deal with cursive styling. I guess the argument is that if the type designer puts a Because |
I'm looking into this. The ၎ Aforementioned seems to be a different animal, at least in Burmese, but I'd like to get a bit of info about some of the other languages. |
This is a tricky one. I believe that the inclusion of numbers in the syllable expression is needed in order to handle ordinal-number sylllables ( |
Okay; it seem like it is a definite yes that a Similarly, U+104C, U+104D, and U+104F need to be able to take some marks (for tones, primarily), and U+104E can take those but also is commonly (possibly always) found with Asat. So U+104C/D/F should likely go into the @wezm if you have the time, it would be nice to look at the corpus and pull out how many U+104E do or don't get an Asat right after them. I'm told that U+104C/D/F are historic and mainly found in manuscripts, so there may not be a lot of examples to find. But if A second question the corpus might could shed light on is how any multi-digit numbers there are that exist in "word like" expressions, with full letters or marks appended. For me it's kind of still an open question how a So a second corpus-mining request would be any examples of multi-digit numbers that have letters or marks. The wiser minds I have been learning from say that I'm still leaning towards preserving |
So, a simple search of good.my in https://github.com/yeslogic/corpus/ turns up quite a few tokens that involve number+mark sequences, like ၉့ (Nine,DotBelow) and ၀ှ (Zero, Medial Ha). I don't find those in the Google corpuscrawler wordcount file, though. And I'm still unsure if all of the "unit" cases are suffixes. Or, perhaps more precisely, if the "unit" cases attach to the least-significant digit. Still, good to know that medial consonants and |
I've also encountered a number of tokens that have Mynamar Digit Zero ( I think Google Translate autocorrects some of those, because some "zero"-containing words have translations, but if there are any full word-pairs that differ only by the letter being ၀ or ဝ, that would be useful to know. |
(clarifying the above: that's in the corpuscrawler count. The "my / Burmese" file in the list here (1007K, so not linking directly): https://github.com/google/corpuscrawler ) |
Tangentially, it would also be informative to catalogue any sequences that are "Number, two or more Marks", which I don't think I've found in the frequency lists, and Number,Mark sequences that involve a Medial Ra (there certainly might not be any abbreviations/units with Medal Ra in them, but how the reordering and "enclosing" is handled could be illuminating). |
Just catching up on things. It's a bit unclear, what's currently needed from me? |
I'm working on Myanmar shaping in Allsorts and have a query about how punctuation should be handled in syllable splitting. There are these punctuation characters in the Myanmar character tables but they don't seem to be matched by any rules.
U+104A
U+104B
U+104C
U+104D
U+104F
I've run my implementation against this text "ပို၍စောစီးစွာပေးပါက" and ၍ is tripping it up. It has no shaping class/rules that match it in the syllable identification details.
There are these two notes though:
and
I'm wondering how these characters should be handled, since their use doesn't feel like a broken expression?
One other note: ။ and ၊ are referenced in the non-terminal
_punc_ = "Little Section" | "Section" however
punc` does not appear to be used, wondering if that's intended?Edit: I see the following on the OpenType Myanmar page:
Which suggests ၍ and friends should be accepted as cluster by themselves.
The text was updated successfully, but these errors were encountered: