Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalization to incoming Unicode data #45

Merged
merged 3 commits into from
Aug 19, 2017

Conversation

SethMMorton
Copy link
Owner

To minimize astonishment, Unicode data is now normalized to the 'NFD'
normalization form to make the sorting sane. Documentation has been updated
to discuss this where appropriate.

The user has the option of choosing 'NFKD'.

All unicode input now gets 'NFD' normalization, which ensures that
all characters that look the same are represented by the same code
points. 'NFD' was chosen because it is the expanded for which will
cause (for example) 'é' to be placed immediately after 'e' rather than
after 'z'.

Users can choose 'NFKD' with ns.COMPATIBILITYNORMALIZE (or ns.CN) which
will change certain characters to their compatible (and often ASCII)
representation. This may be useful to cause force numbers in odd
representations to be transformed to ASCII which will potentially give
better sorting orders.

This will close issue #44.
@SethMMorton
Copy link
Owner Author

This will address #44

The input normalization has been moved out of the "input_transform"
function (which was called in the "parse_string" function) and now
is the first step of the "parse_string" function. This is because
the data needs to be normalized even if the "input_transform" function
is skipped.

Tests have been reworked to understand this change.
@codecov
Copy link

codecov bot commented Aug 19, 2017

Codecov Report

Merging #45 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
+ Coverage   99.61%   99.62%   +<.01%     
==========================================
  Files          10       10              
  Lines         522      535      +13     
==========================================
+ Hits          520      533      +13     
  Misses          2        2
Impacted Files Coverage Δ
natsort/utils.py 100% <100%> (ø) ⬆️
natsort/ns_enum.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2f4b5d...059de48. Read the comment docs.

@SethMMorton SethMMorton merged commit 11b7f8d into master Aug 19, 2017
@SethMMorton SethMMorton deleted the unicode-normalization branch August 19, 2017 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant