Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using DerivedCombiningClass.txt to determine width is inappropriate #10

Closed
philipc opened this issue Aug 20, 2015 · 5 comments
Closed
Labels

Comments

@philipc
Copy link
Contributor

philipc commented Aug 20, 2015

DerivedCombiningClass.txt contains the Canonical_Combining_Class field from UnicodeData.txt (see http://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values). This field is intended to be used for the collation algorithm.

wcwidth.py is currently assuming that characters are zero width combining characters if and only if they have a non-zero combining class. I think this is an invalid assumption. For example, characters that are enclosing marks (General Category = Me) all have a zero combining class, but they are also zero width combining characters.

I'm not sure what the standard way to determine zero width combining characters is. One possibility is to check for a General Category of Mn or Me, but I don't know if there are any exceptions to this. Also note that there are combining characters that do have a width (category Mc).

@jquast
Copy link
Owner

jquast commented Aug 27, 2015

I'm not sure what the standard way to determine zero width combining characters is.

Me neither, any help appreciated, or if you prefer such caveats documented in greater detail in the README, feel free to pull request and add to such section, https://github.com/jquast/wcwidth/blob/master/README.rst#todo or rename it 'caveats'

@jquast
Copy link
Owner

jquast commented Aug 27, 2015

Please see related tickets #2 and PR #5

@philipc
Copy link
Contributor Author

philipc commented Aug 28, 2015

I think that checking for a general category of Mn or Me is closer to correct than using the canonical combining class, so I've started working on that change.

One issue I've encountered so far is the behavior of returning -1 for combining characters. I think -1 is better reserved for format characters that have an undefined width. Combining characters should return their actual width (0 or 1). This would be consistent with the behavior of http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

@jquast
Copy link
Owner

jquast commented Sep 12, 2015

Agreed about returning '0' instead of -1, (issue #1).

-1 was only chosen to match linux and mac osx libc implementations until it can be corrected, which I think your PR does, many thanks.

@jquast
Copy link
Owner

jquast commented Sep 14, 2015

Wonderful work, not a single code error. Thank you for the contribution, the referenced PR is now included as-is to pypi as 0.1.5.

@jquast jquast closed this as completed Sep 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants