Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting arabic, farsi or hebrew numbers is not "natural" #52

Closed
hholzgra opened this issue Mar 30, 2018 · 8 comments · Fixed by #54
Closed

Sorting arabic, farsi or hebrew numbers is not "natural" #52

hholzgra opened this issue Mar 30, 2018 · 8 comments · Fixed by #54
Assignees
Labels

Comments

@hholzgra
Copy link
Contributor

Minimum, Complete, Verifiable Example

import locale
from natsort import natsorted,natsort_keygen, ns

locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')

streets = [ 
        "1st street",
        "10th street",
        "2nd street",
        "2 street", 	
        "1 street",
        "1street",
        "12street",
        "11 street",
        "street 23",
        "street 2",
        "street 1",
        "Street 11",
        "۲ street",
        "۱ street",
        "۱street",
        "۱۲street",
        "۱۱ street",
        "street ۲",
        "street ۱",
        "street ۱",
        "street ۱۲",
        "street ۱۱",
]

sl = natsorted(streets, alg= ns.LOCALE | ns.IGNORECASE) 

for name in sl:
    print(name)

Error message, Traceback, Desired behavior, Suggestion, Request, or Question

Expected result:

1 street
۱ street
1st street
1street
۱street
2 street
۲ street
2nd street
10th street
11 street
۱۱ street
12street
۱۲street
street 1
street ۱
street ۱
street 2
street ۲
Street 11
street ۱۱
street ۱۲
street 23

Actual result:

۱۱ street
۱۲street
1street
۱street
1 street
۱ street
1st street
2nd street
2 street
۲ street
10th street
11 street
12street
street 1
street ۱
street ۱
street ۱۱
street ۱۲
street 2
street ۲
Street 11
street 23
@hholzgra
Copy link
Contributor Author

I can make this work for strings that start with a number by extending the regular expressions that check for digits to accept multiple unicode digits:

diff --git a/natsort/utils.py b/natsort/utils.py
index b6484b0..9055da8 100644
--- a/natsort/utils.py
+++ b/natsort/utils.py
@@ -96,9 +96,9 @@ _float_nosign_noexp_re = _float_nosign_noexp_re.format(_num, numeric)
 _float_nosign_noexp_re = re.compile(_float_nosign_noexp_re, flags=re.U)
 
 # Integer regexes - include Unicode digits.
-_int_nosign_re = r'([0-9]+|[{0}])'.format(digits)
+_int_nosign_re = r'([0-9]+|[{0}]+)'.format(digits)
 _int_nosign_re = re.compile(_int_nosign_re, flags=re.U)
-_int_sign_re = r'([-+]?[0-9]+|[{0}])'.format(digits)
+_int_sign_re = r'([-+]?[0-9]+|[{0}]+)'.format(digits)
 _int_sign_re = re.compile(_int_sign_re, flags=re.U)
 
 # This dict will help select the correct regex and number conversion function.

With numbers at the end of the stings things fail though when there's more than one unicode digit, like in "street ۱۲". This causes the same "TypeError: unorderable types: float() < str()" as in issue #7 again.

Turns out that the fake_fastnumbers implementation only identifies single digit unicode numbers as "int", too, otherwise returning "string". When fixing the fastnumbers version check (issue #51) so that the real fastnumbers functions are used things work fine as expected after all.

@hholzgra
Copy link
Contributor Author

My regex change makes one test fail when testing against Python 3.5. With Python 2.7 all tests pass:


=================================================================================== FAILURES ====================================================================================
_________________________________________________________ test_parse_string_factory_only_parses_digits_with_nosign_int __________________________________________________________

    @given(lists(elements=floats() | text().filter(whitespace_check) | integers(), min_size=1, max_size=10))
>   @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000])
    def test_parse_string_factory_only_parses_digits_with_nosign_int(x):

test_natsort/test_parse_string_function.py:82: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:581: in execute
    result = self.test_runner(data, run)
.tox/py35/lib/python3.5/site-packages/hypothesis/executors.py:58: in default_new_style_executor
    return function(data)
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:573: in run
    return test(*args, **kwargs)
test_natsort/test_parse_string_function.py:82: in test_parse_string_factory_only_parses_digits_with_nosign_int
    @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:520: in test
    result = self.test(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = ['᧐᧐']

    @given(lists(elements=floats() | text().filter(whitespace_check) | integers(), min_size=1, max_size=10))
    @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000])
    def test_parse_string_factory_only_parses_digits_with_nosign_int(x):
        s = ''.join(repr(y) if type(y) in (float, long, int) else y for y in x)
>       assert _parse_string_factory(0, '', _int_nosign_re.split, no_op, fast_int, tuple2)(s) == int_splitter(s, False, '')
E       AssertionError: assert ('᧐᧐',) == ('', 0, '', 0)
E         At index 0 diff: '᧐᧐' != ''
E         Right contains more items, first extra item: 0
E         Use -v to get the full diff

test_natsort/test_parse_string_function.py:87: AssertionError
---------------------------------------------------------------------------------- Hypothesis -----------------------------------------------------------------------------------
Falsifying example: test_parse_string_factory_only_parses_digits_with_nosign_int(x=['᧐᧐'])


@SethMMorton
Copy link
Owner

SethMMorton commented Mar 31, 2018

The reason I chose to not attempt to combine non-ASCII digits into numbers as is done for ASCII numbers is that it was not clear to me if they could be treated the same way 100% of the time. For example, I know that ⅐ would need special care, but this is classified as a number in unicode and not a digit so perhaps I should not worry.

Should it be safe to treat any unicode digit as it's ASCII equivalent when converting to integers? For example, I am assuming that ۱۲ should be converted to 12? What about if there are decimal points - would ۱۲.۱۲ be the float 12.12? Would this be true for other character sets?

@hholzgra
Copy link
Contributor Author

I only care about integers at this point, not floats, the background being that I was looking for a more natural sorting of street names in the street indexes generated by MapOSMatic.

This looks good now for my two original test cases, New York City:

https://maposmatic.osm-baustelle.de/maps/16968

and a town in Iran:

https://maposmatic.osm-baustelle.de/maps/16961

For example, I know that ⅐ would need special care, but this is classified as a number in unicode and not a digit so perhaps I should not worry.

I now checked for cities using ½ in their street numbering scheme in the OpenStreetMap database and found

https://maposmatic.osm-baustelle.de/maps/1699

where I can see things like:

 79th Street
 80th Avenue West
 80th Street
 8½th Avenue
 8½th Street
 8½th Street Court
 8½th Street West
 81st Avenue West
 82nd Avenue West

Seeing 8½th between 80th and 81st is a bit odd, but as we only have some ~1300 roads worldwide that use ½ (and none using the other fractionals like 1/3 or 1/4) I can live with that for now.

Should it be safe to treat any unicode digit as it's ASCII equivalent when converting to integers? For example, I am assuming that ۱۲ should be converted to 12?

That should be safe. I could only check with native speakers from Iran (for Persian/Farsi) and Syria (for Arabic) so far, but I also have contacts from Malaysia, Israel, South Korea, and maybe Japan and China that I can ask to verify that street lists come out correctly (if there are examples of numbered streets in those countries at all).

What about if there are decimal points - would ۱۲.۱۲ be the float 12.12? Would this be true for other character sets?

That seems to be way more tricky, at least from looking at https://en.wikipedia.org/wiki/Decimal_separator ...

@SethMMorton
Copy link
Owner

It sounds like the validation you are using to assess if changing the natsort heuristic for identifying numbers is only within the context of street names. My concern is that natsort is intended to be general for all domains, so I want to make sure that any changes made will not break some other assumption in another domain.

Having said that, I do agree some update should be done, especially considering as you point out that fastnumbers handles this properly. I am actually surprised by that, because I did not code it to handle numbers like ۱۲, so I'll need to look into that.

I think that perhaps just updating the regex and also fake_fastnumbers should be sufficient, but I am a bit over-careful about things like this so I will do more investigation.

@SethMMorton
Copy link
Owner

I had forgotten that I added code in fastnumbers that translates any unicode decimal to the ASCII equivalent. Under the hood this is what python does for int or float.

Thanks for reporting this.

@SethMMorton
Copy link
Owner

@hholzgra Can you review the changes I made in PR #54 and see if those will support your use case?

@SethMMorton
Copy link
Owner

@hholzgra I think I am going to go forward with this release in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants