-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Myanmar] Syllable reordering #165
Comments
This is definitely something else that has needed a fresh look. The doc currently does not address the tone marks ... I evidently left a comment to that effect and had commented-out the Complicating that is that HarfBuzz treats all Myanmar tone marks as |
No worries. No doubt it's all quite complicated to work out. |
I'm going to go ahead and push some SVG images changes that I was intermittently working on before; in theory they're better than PNGs for zooming in to see details, but mainly I don't want to attempt to fix the text then try to re-do the merge. |
Okay; went on a little tangent about the NUKTA, apparently. There's a good rationale for the way HarfBuzz classifies U+1037 as NUKTA, that UCD's DerivedCombiningClass.txt groups it there. So that's good. It just happens to complicate inspecting test fonts (this issue against Noto Myanmar (Sans and Serif) looks to be related, but I think is actually caused by Noto ligating the 102f,1037 which HarfBuzz calls that unsafe ... and has some apparent workarounds (?) with inserting narrow spaces. Paduak doesn't do that, and seems like a better reference). That being a little clearer to me now, I think that HarfBuzz is taking a different approach to the post-base tagging in step 2 (being more selective with what gets POS_AFTER_SUBJOINED, leaving more things POS_AFTER_MAIN) and maybe that's worth a try. With fewer things being _AFTER_SUB, fewer conflicts. |
I think I've got a way to untangle the reordering problem sorted out (initially, at least) in a way that makes sense to me. Please have a look at stage 2 in #168 and let me know both if it reads correctly and sounds like it applies. It's basically just streamlining the post-base logic. Admittedly, it can't quite work in isolation from fixing up the regular expressions for syllable-matching, but they kind of go together in this case, I think. Meaning that, because the syllable matching is so complex with the post-base clusters, valid syllables that match do not need as much reordering on the post-base subsequences. They'll match the GSUB/GPOS rules because they wouldn't get identified as syllables otherwise. The logic in the PR basically splits things at "has post-base matras" or "doesn't have post-base matras", in new steps 5 and 6, but it also adds in treatment of variation selectors, which I had not addressed previously. (Less importantly, this change also removes a reference to "final reordering" earlier on that was clearly an artifact from reduplicating Indic2 doc structure.) |
I think the changes in #168 sound like they apply. Skimming over my code it seems to match up pretty well. Some small comments:
It would be good if this step used the shaping classes/mark-placement subclass/regex classes defined earlier to make it clear which things it should match.
Similar for this description. |
Sounds good. One thing I'm still a little less clear about is what's desirable if there are multiple below-base vowels in a row and one in the middle has an anusvara on it.... If the anusvaras get moved together, that sounds like something is getting lost, but I'm not sure if I understand how it would change pronunciation on a string of multiple vowels (or, indeed, if it actually happens). Is there tooling for searching the text corpus for something like that? |
This meaning "if a single anusvara would automatically apply to the entire vowel subsequence, then it's fine to move as long as it stays with the below-base vowels generally." |
Not that I know of but cobbled something together to try to see if I could find multiple below-base vowels in a row. As far as I can tell none of them had Anusvara in the middle, only after the last vowel. Such as:
|
I was a little surprised that Python's unicodedata doesn't include ISC and some of those other properties.... But it seems like the sort of thing that, if you were to volunteer to add it, you'd shortly afterward be saddled with maintaining the module in perpetuity.... I made an update to the text to reference those classes; I'm still looking at whether I need to do that with the |
I've encountered an issue with the reordering description that results in MYANMAR DOT BELOW being moved when it probably should stay where it is.
The problem is that "All right-side and above-base dependent-vowel (matra) signs are tagged
POS_AFTER_SUBJOINED
." but DOT BELOW ends up withPOS_AFTER_MAIN
, which comes beforePOS_AFTER_SUBJOINED
in the sort order.Perhaps DOT BELOW is supposed to be picked up in step 7 but then in a syllable like "မော့" (U+1019 U+1031 U+102C U+1037) it's unclear what that step would set the pos to.
The text was updated successfully, but these errors were encountered: