-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using DerivedCombiningClass.txt to determine width is inappropriate #10
Comments
Me neither, any help appreciated, or if you prefer such caveats documented in greater detail in the README, feel free to pull request and add to such section, https://github.com/jquast/wcwidth/blob/master/README.rst#todo or rename it 'caveats' |
I think that checking for a general category of Mn or Me is closer to correct than using the canonical combining class, so I've started working on that change. One issue I've encountered so far is the behavior of returning -1 for combining characters. I think -1 is better reserved for format characters that have an undefined width. Combining characters should return their actual width (0 or 1). This would be consistent with the behavior of http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c |
Agreed about returning '0' instead of -1, (issue #1). -1 was only chosen to match linux and mac osx libc implementations until it can be corrected, which I think your PR does, many thanks. |
Wonderful work, not a single code error. Thank you for the contribution, the referenced PR is now included as-is to pypi as 0.1.5. |
DerivedCombiningClass.txt contains the Canonical_Combining_Class field from UnicodeData.txt (see http://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values). This field is intended to be used for the collation algorithm.
wcwidth.py is currently assuming that characters are zero width combining characters if and only if they have a non-zero combining class. I think this is an invalid assumption. For example, characters that are enclosing marks (General Category = Me) all have a zero combining class, but they are also zero width combining characters.
I'm not sure what the standard way to determine zero width combining characters is. One possibility is to check for a General Category of Mn or Me, but I don't know if there are any exceptions to this. Also note that there are combining characters that do have a width (category Mc).
The text was updated successfully, but these errors were encountered: