Add normalization to incoming Unicode data #45

SethMMorton · 2017-08-19T06:28:25Z

To minimize astonishment, Unicode data is now normalized to the 'NFD'
normalization form to make the sorting sane. Documentation has been updated
to discuss this where appropriate.

The user has the option of choosing 'NFKD'.

All unicode input now gets 'NFD' normalization, which ensures that all characters that look the same are represented by the same code points. 'NFD' was chosen because it is the expanded for which will cause (for example) 'é' to be placed immediately after 'e' rather than after 'z'. Users can choose 'NFKD' with ns.COMPATIBILITYNORMALIZE (or ns.CN) which will change certain characters to their compatible (and often ASCII) representation. This may be useful to cause force numbers in odd representations to be transformed to ASCII which will potentially give better sorting orders. This will close issue #44.

SethMMorton · 2017-08-19T06:29:08Z

This will address #44

The input normalization has been moved out of the "input_transform" function (which was called in the "parse_string" function) and now is the first step of the "parse_string" function. This is because the data needs to be normalized even if the "input_transform" function is skipped. Tests have been reworked to understand this change.

codecov · 2017-08-19T07:36:30Z

Codecov Report

Merging #45 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
+ Coverage   99.61%   99.62%   +<.01%     
==========================================
  Files          10       10              
  Lines         522      535      +13     
==========================================
+ Hits          520      533      +13     
  Misses          2        2

Impacted Files	Coverage Δ
natsort/utils.py	`100% <100%> (ø)`	⬆️
natsort/ns_enum.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2f4b5d...059de48. Read the comment docs.

SethMMorton added 2 commits August 18, 2017 23:24

Update documentation to discuss Unicode normalization.

06a67bf

SethMMorton added the feature label Aug 19, 2017

SethMMorton merged commit 11b7f8d into master Aug 19, 2017

SethMMorton deleted the unicode-normalization branch August 19, 2017 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add normalization to incoming Unicode data #45

Add normalization to incoming Unicode data #45

SethMMorton commented Aug 19, 2017

SethMMorton commented Aug 19, 2017

codecov bot commented Aug 19, 2017 •

edited

Loading

Add normalization to incoming Unicode data #45

Add normalization to incoming Unicode data #45

Conversation

SethMMorton commented Aug 19, 2017

SethMMorton commented Aug 19, 2017

codecov bot commented Aug 19, 2017 • edited Loading

Codecov Report

codecov bot commented Aug 19, 2017 •

edited

Loading