Sorting arabic, farsi or hebrew numbers is not "natural" #52

hholzgra · 2018-03-30T19:24:17Z

Minimum, Complete, Verifiable Example

import locale
from natsort import natsorted,natsort_keygen, ns

locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')

streets = [ 
        "1st street",
        "10th street",
        "2nd street",
        "2 street", 	
        "1 street",
        "1street",
        "12street",
        "11 street",
        "street 23",
        "street 2",
        "street 1",
        "Street 11",
        "۲ street",
        "۱ street",
        "۱street",
        "۱۲street",
        "۱۱ street",
        "street ۲",
        "street ۱",
        "street ۱",
        "street ۱۲",
        "street ۱۱",
]

sl = natsorted(streets, alg= ns.LOCALE | ns.IGNORECASE) 

for name in sl:
    print(name)

Error message, Traceback, Desired behavior, Suggestion, Request, or Question

Expected result:

1 street
۱ street
1st street
1street
۱street
2 street
۲ street
2nd street
10th street
11 street
۱۱ street
12street
۱۲street
street 1
street ۱
street ۱
street 2
street ۲
Street 11
street ۱۱
street ۱۲
street 23

Actual result:

۱۱ street
۱۲street
1street
۱street
1 street
۱ street
1st street
2nd street
2 street
۲ street
10th street
11 street
12street
street 1
street ۱
street ۱
street ۱۱
street ۱۲
street 2
street ۲
Street 11
street 23

The text was updated successfully, but these errors were encountered:

hholzgra · 2018-03-30T19:31:43Z

I can make this work for strings that start with a number by extending the regular expressions that check for digits to accept multiple unicode digits:

diff --git a/natsort/utils.py b/natsort/utils.py
index b6484b0..9055da8 100644
--- a/natsort/utils.py
+++ b/natsort/utils.py
@@ -96,9 +96,9 @@ _float_nosign_noexp_re = _float_nosign_noexp_re.format(_num, numeric)
 _float_nosign_noexp_re = re.compile(_float_nosign_noexp_re, flags=re.U)
 
 # Integer regexes - include Unicode digits.
-_int_nosign_re = r'([0-9]+|[{0}])'.format(digits)
+_int_nosign_re = r'([0-9]+|[{0}]+)'.format(digits)
 _int_nosign_re = re.compile(_int_nosign_re, flags=re.U)
-_int_sign_re = r'([-+]?[0-9]+|[{0}])'.format(digits)
+_int_sign_re = r'([-+]?[0-9]+|[{0}]+)'.format(digits)
 _int_sign_re = re.compile(_int_sign_re, flags=re.U)
 
 # This dict will help select the correct regex and number conversion function.

With numbers at the end of the stings things fail though when there's more than one unicode digit, like in "street ۱۲". This causes the same "TypeError: unorderable types: float() < str()" as in issue #7 again.

Turns out that the fake_fastnumbers implementation only identifies single digit unicode numbers as "int", too, otherwise returning "string". When fixing the fastnumbers version check (issue #51) so that the real fastnumbers functions are used things work fine as expected after all.

hholzgra · 2018-03-30T19:33:44Z

My regex change makes one test fail when testing against Python 3.5. With Python 2.7 all tests pass:


=================================================================================== FAILURES ====================================================================================
_________________________________________________________ test_parse_string_factory_only_parses_digits_with_nosign_int __________________________________________________________

    @given(lists(elements=floats() | text().filter(whitespace_check) | integers(), min_size=1, max_size=10))
>   @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000])
    def test_parse_string_factory_only_parses_digits_with_nosign_int(x):

test_natsort/test_parse_string_function.py:82: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:581: in execute
    result = self.test_runner(data, run)
.tox/py35/lib/python3.5/site-packages/hypothesis/executors.py:58: in default_new_style_executor
    return function(data)
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:573: in run
    return test(*args, **kwargs)
test_natsort/test_parse_string_function.py:82: in test_parse_string_factory_only_parses_digits_with_nosign_int
    @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
.tox/py35/lib/python3.5/site-packages/hypothesis/core.py:520: in test
    result = self.test(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = ['᧐᧐']

    @given(lists(elements=floats() | text().filter(whitespace_check) | integers(), min_size=1, max_size=10))
    @example([10000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000,
              100000000000000000000000000000000000000000000000000000000000000000000000000])
    def test_parse_string_factory_only_parses_digits_with_nosign_int(x):
        s = ''.join(repr(y) if type(y) in (float, long, int) else y for y in x)
>       assert _parse_string_factory(0, '', _int_nosign_re.split, no_op, fast_int, tuple2)(s) == int_splitter(s, False, '')
E       AssertionError: assert ('᧐᧐',) == ('', 0, '', 0)
E         At index 0 diff: '᧐᧐' != ''
E         Right contains more items, first extra item: 0
E         Use -v to get the full diff

test_natsort/test_parse_string_function.py:87: AssertionError
---------------------------------------------------------------------------------- Hypothesis -----------------------------------------------------------------------------------
Falsifying example: test_parse_string_factory_only_parses_digits_with_nosign_int(x=['᧐᧐'])

SethMMorton · 2018-03-31T00:35:08Z

The reason I chose to not attempt to combine non-ASCII digits into numbers as is done for ASCII numbers is that it was not clear to me if they could be treated the same way 100% of the time. For example, I know that ⅐ would need special care, but this is classified as a number in unicode and not a digit so perhaps I should not worry.

Should it be safe to treat any unicode digit as it's ASCII equivalent when converting to integers? For example, I am assuming that ۱۲ should be converted to 12? What about if there are decimal points - would ۱۲.۱۲ be the float 12.12? Would this be true for other character sets?

hholzgra · 2018-03-31T08:42:10Z

I only care about integers at this point, not floats, the background being that I was looking for a more natural sorting of street names in the street indexes generated by MapOSMatic.

This looks good now for my two original test cases, New York City:

https://maposmatic.osm-baustelle.de/maps/16968

and a town in Iran:

https://maposmatic.osm-baustelle.de/maps/16961

For example, I know that ⅐ would need special care, but this is classified as a number in unicode and not a digit so perhaps I should not worry.

I now checked for cities using ½ in their street numbering scheme in the OpenStreetMap database and found

https://maposmatic.osm-baustelle.de/maps/1699

where I can see things like:

 79th Street
 80th Avenue West
 80th Street
 8½th Avenue
 8½th Street
 8½th Street Court
 8½th Street West
 81st Avenue West
 82nd Avenue West

Seeing 8½th between 80th and 81st is a bit odd, but as we only have some ~1300 roads worldwide that use ½ (and none using the other fractionals like 1/3 or 1/4) I can live with that for now.

Should it be safe to treat any unicode digit as it's ASCII equivalent when converting to integers? For example, I am assuming that ۱۲ should be converted to 12?

That should be safe. I could only check with native speakers from Iran (for Persian/Farsi) and Syria (for Arabic) so far, but I also have contacts from Malaysia, Israel, South Korea, and maybe Japan and China that I can ask to verify that street lists come out correctly (if there are examples of numbered streets in those countries at all).

What about if there are decimal points - would ۱۲.۱۲ be the float 12.12? Would this be true for other character sets?

That seems to be way more tricky, at least from looking at https://en.wikipedia.org/wiki/Decimal_separator ...

SethMMorton · 2018-04-05T01:31:47Z

It sounds like the validation you are using to assess if changing the natsort heuristic for identifying numbers is only within the context of street names. My concern is that natsort is intended to be general for all domains, so I want to make sure that any changes made will not break some other assumption in another domain.

Having said that, I do agree some update should be done, especially considering as you point out that fastnumbers handles this properly. I am actually surprised by that, because I did not code it to handle numbers like ۱۲, so I'll need to look into that.

I think that perhaps just updating the regex and also fake_fastnumbers should be sufficient, but I am a bit over-careful about things like this so I will do more investigation.

SethMMorton · 2018-04-05T02:37:12Z

I had forgotten that I added code in fastnumbers that translates any unicode decimal to the ASCII equivalent. Under the hood this is what python does for int or float.

Thanks for reporting this.

SethMMorton · 2018-04-11T05:38:57Z

@hholzgra Can you review the changes I made in PR #54 and see if those will support your use case?

SethMMorton · 2018-04-18T03:34:18Z

@hholzgra I think I am going to go forward with this release in the next few days.

SethMMorton added the bug label Apr 5, 2018

SethMMorton self-assigned this Apr 11, 2018

SethMMorton mentioned this issue Apr 11, 2018

Add support for ints and floats comprised of unicode decimal strings. #54

Merged

SethMMorton added a commit that referenced this issue Apr 12, 2018

Add explicit tests based on examples in issue #52.

5f0abd8

SethMMorton closed this as completed in #54 Apr 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting arabic, farsi or hebrew numbers is not "natural" #52

Sorting arabic, farsi or hebrew numbers is not "natural" #52

hholzgra commented Mar 30, 2018

hholzgra commented Mar 30, 2018

hholzgra commented Mar 30, 2018

SethMMorton commented Mar 31, 2018 •

edited

Loading

hholzgra commented Mar 31, 2018

SethMMorton commented Apr 5, 2018

SethMMorton commented Apr 5, 2018

SethMMorton commented Apr 11, 2018

SethMMorton commented Apr 18, 2018

Sorting arabic, farsi or hebrew numbers is not "natural" #52

Sorting arabic, farsi or hebrew numbers is not "natural" #52

Comments

hholzgra commented Mar 30, 2018

Minimum, Complete, Verifiable Example

Error message, Traceback, Desired behavior, Suggestion, Request, or Question

hholzgra commented Mar 30, 2018

hholzgra commented Mar 30, 2018

SethMMorton commented Mar 31, 2018 • edited Loading

hholzgra commented Mar 31, 2018

SethMMorton commented Apr 5, 2018

SethMMorton commented Apr 5, 2018

SethMMorton commented Apr 11, 2018

SethMMorton commented Apr 18, 2018

SethMMorton commented Mar 31, 2018 •

edited

Loading