-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clib: Fix the bug when passing multiple columns of strings with variable lengths to the GMT C API #2719
Conversation
I just tried different versions of @seismann's code example: import numpy as np
string_arrays01 = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
test01 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays01)
test01
# array(['ABC ABC', 'DEF DEF'], dtype='<U7') -> cut after 7 signs
string_arrays02 = [np.array(["ABC", "DEFGHIJK"]), np.array(["ABC", "DEF"])]
test02 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays02)
test02
# array(['ABC ABC', 'DEFGHIJ'], dtype='<U7') -> cut after 7 signs
string_arrays03= [np.array(["ABC", "DEF"]), np.array(["DEFGHIJK", "ABC"])]
test03 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays03)
test03
# array(['ABC DEFGHIJK', 'DEF ABC'], dtype='<U12') -> NOT cut after 7 signs and now <U12 instead of <U7
string_arrays04 = [np.array(["DEFGHIJK", "ABC"]), np.array(["ABC", "DEF"])]
test04 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays04)
test04
# array(['DEFGHIJK ABC', 'ABC DEF'], dtype='<U12') -> NOT cut after 7 signs and now <U12 instead of <U7 Here, it looks like the length of the first concentrated string sets the maximum length of all other / following concentrated strings. This occurs identically when using string_arrays05 = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
test05 = np.apply_along_axis(func1d=" ".join, axis=1, arr=string_arrays05)
test05
# array(['ABC DEF', 'ABC DEF'], dtype='<U7') -> cut after 7 signs
string_arrays06 = [np.array(["ABC", "DEFGHIJK"]), np.array(["ABC", "DEF"])]
test06 = np.apply_along_axis(func1d=" ".join, axis=1, arr=string_arrays06)
test06
# array(['ABC DEFGHIJK', 'ABC DEF'], dtype='<U12') -> NOT cut after 7 signs and now <U12 instead of <U7 |
If you do Google search "apply_along_axis string arrays", you will see many posts that have exactly the same error, and also the upstream issue report numpy/numpy#8352. |
>>> import numpy as np
>>> string_arrays = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
>>> %timeit np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays)
37.3 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit np.array([" ".join(vals) for vals in zip(*string_arrays)])
3.02 µs ± 278 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) The new & correct version is 10x faster than the old & incorrect version! |
Thanks @seisman for spotting the bug, fixing it and adding the benchmark!
Could you try to add a unit test to |
Modified the existing test With the main branch, the test fails:
Done in 80e500c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified the existing test
test_virtualfile_from_vectors_two_string_or_object_columns
to catch this bug.
Brilliant! Wish I had used a different string length back in #520 (you actually asked me to make variable length strings in #520 (comment), but I didn't make sure that string1 + string2 were different lengths) 😆
I've revised the PR title to make it more descriptive as a changelog entry. |
Description of proposed changes
Found this bug when working on PR #2720.
The changed code is meant to element-wisely join two string arrays with a space, but it doesn't work if strings have different lengths.
Here is a simple example to reproduce the issue:
Obvisouly the result is incorrect, although I don't fully understand why it doesn't work.
The new code works as expected.
BTW: a test related to this bug will be added in PR #2720.