Using DerivedCombiningClass.txt to determine width is inappropriate #10

philipc · 2015-08-20T08:26:58Z

DerivedCombiningClass.txt contains the Canonical_Combining_Class field from UnicodeData.txt (see http://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values). This field is intended to be used for the collation algorithm.

wcwidth.py is currently assuming that characters are zero width combining characters if and only if they have a non-zero combining class. I think this is an invalid assumption. For example, characters that are enclosing marks (General Category = Me) all have a zero combining class, but they are also zero width combining characters.

I'm not sure what the standard way to determine zero width combining characters is. One possibility is to check for a General Category of Mn or Me, but I don't know if there are any exceptions to this. Also note that there are combining characters that do have a width (category Mc).

jquast · 2015-08-27T17:10:24Z

I'm not sure what the standard way to determine zero width combining characters is.

Me neither, any help appreciated, or if you prefer such caveats documented in greater detail in the README, feel free to pull request and add to such section, https://github.com/jquast/wcwidth/blob/master/README.rst#todo or rename it 'caveats'

jquast · 2015-08-27T17:13:02Z

Please see related tickets #2 and PR #5

philipc · 2015-08-28T08:11:22Z

I think that checking for a general category of Mn or Me is closer to correct than using the canonical combining class, so I've started working on that change.

One issue I've encountered so far is the behavior of returning -1 for combining characters. I think -1 is better reserved for format characters that have an undefined width. Combining characters should return their actual width (0 or 1). This would be consistent with the behavior of http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

jquast · 2015-09-12T16:08:37Z

Agreed about returning '0' instead of -1, (issue #1).

-1 was only chosen to match linux and mac osx libc implementations until it can be corrected, which I think your PR does, many thanks.

jquast · 2015-09-14T06:27:22Z

Wonderful work, not a single code error. Thank you for the contribution, the referenced PR is now included as-is to pypi as 0.1.5.

jquast added bug enhancement needs-research and removed enhancement labels Aug 27, 2015

jquast mentioned this issue Aug 27, 2015

Are Combining characters handled correctly? #2

Closed

philipc mentioned this issue Sep 2, 2015

Improve handling of combining characters #11

Merged

jquast removed the needs-research label Sep 14, 2015

jquast closed this as completed Sep 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using DerivedCombiningClass.txt to determine width is inappropriate #10

Using DerivedCombiningClass.txt to determine width is inappropriate #10

philipc commented Aug 20, 2015

jquast commented Aug 27, 2015

jquast commented Aug 27, 2015

philipc commented Aug 28, 2015

jquast commented Sep 12, 2015

jquast commented Sep 14, 2015

Using DerivedCombiningClass.txt to determine width is inappropriate #10

Using DerivedCombiningClass.txt to determine width is inappropriate #10

Comments

philipc commented Aug 20, 2015

jquast commented Aug 27, 2015

jquast commented Aug 27, 2015

philipc commented Aug 28, 2015

jquast commented Sep 12, 2015

jquast commented Sep 14, 2015