Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib: Fix the bug when passing multiple columns of strings with variable lengths to the GMT C API #2719

Merged
merged 3 commits into from
Oct 9, 2023

Conversation

seisman
Copy link
Member

@seisman seisman commented Oct 7, 2023

Description of proposed changes

Found this bug when working on PR #2720.

The changed code is meant to element-wisely join two string arrays with a space, but it doesn't work if strings have different lengths.

Here is a simple example to reproduce the issue:

>>> import numpy as np
>>> string_arrays = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
>>> np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays)
array(['ABC ABC', 'DEF DEF'], dtype='<U7')

Obvisouly the result is incorrect, although I don't fully understand why it doesn't work.

The new code works as expected.

>>> np.array([" ".join(vals) for vals in zip(*string_arrays)])
array(['ABC ABC', 'DEF DEFGHIJK'], dtype='<U12')

BTW: a test related to this bug will be added in PR #2720.

@seisman seisman added the bug Something isn't working label Oct 7, 2023
@seisman seisman added this to the 0.11.0 milestone Oct 7, 2023
@seisman seisman changed the title Fix the bug when concatenating strings arrays with spaces Fix the bug when element-wisely join string arrays with spaces Oct 7, 2023
@seisman seisman changed the title Fix the bug when element-wisely join string arrays with spaces Fix the bug of elementwisely joining string arrays with spaces Oct 7, 2023
@seisman seisman changed the title Fix the bug of elementwisely joining string arrays with spaces Fix the bug of elementwisely joining string arrays with a spce Oct 7, 2023
@seisman seisman changed the title Fix the bug of elementwisely joining string arrays with a spce Fix the bug of elementwisely joining string arrays with a space Oct 7, 2023
@seisman seisman added the needs review This PR has higher priority and needs review. label Oct 7, 2023
@yvonnefroehlich
Copy link
Member

I just tried different versions of @seismann's code example:

import numpy as np

string_arrays01 = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
test01 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays01)
test01
# array(['ABC ABC', 'DEF DEF'], dtype='<U7')  -> cut after 7 signs

string_arrays02 = [np.array(["ABC", "DEFGHIJK"]), np.array(["ABC", "DEF"])]
test02 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays02)
test02
# array(['ABC ABC', 'DEFGHIJ'], dtype='<U7')  -> cut after 7 signs

string_arrays03= [np.array(["ABC", "DEF"]), np.array(["DEFGHIJK", "ABC"])]
test03 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays03)
test03
# array(['ABC DEFGHIJK', 'DEF ABC'], dtype='<U12')  -> NOT cut after 7 signs and now <U12 instead of <U7

string_arrays04 = [np.array(["DEFGHIJK", "ABC"]), np.array(["ABC", "DEF"])]
test04 = np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays04)
test04
# array(['DEFGHIJK ABC', 'ABC DEF'], dtype='<U12')  -> NOT cut after 7 signs and now <U12 instead of <U7

Here, it looks like the length of the first concentrated string sets the maximum length of all other / following concentrated strings.

This occurs identically when using axis=1 instead of axis=0:

string_arrays05 = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
test05 = np.apply_along_axis(func1d=" ".join, axis=1, arr=string_arrays05)
test05 
# array(['ABC DEF', 'ABC DEF'], dtype='<U7')  -> cut after 7 signs

string_arrays06 = [np.array(["ABC", "DEFGHIJK"]), np.array(["ABC", "DEF"])]
test06 = np.apply_along_axis(func1d=" ".join, axis=1, arr=string_arrays06)
test06 
# array(['ABC DEFGHIJK', 'ABC DEF'], dtype='<U12')  -> NOT cut after 7 signs and now <U12 instead of <U7

@seisman
Copy link
Member Author

seisman commented Oct 8, 2023

If you do Google search "apply_along_axis string arrays", you will see many posts that have exactly the same error, and also the upstream issue report numpy/numpy#8352.

@seisman
Copy link
Member Author

seisman commented Oct 8, 2023

>>> import numpy as np
>>> string_arrays = [np.array(["ABC", "DEF"]), np.array(["ABC", "DEFGHIJK"])]
>>> %timeit np.apply_along_axis(func1d=" ".join, axis=0, arr=string_arrays)
37.3 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit np.array([" ".join(vals) for vals in zip(*string_arrays)])
3.02 µs ± 278 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

The new & correct version is 10x faster than the old & incorrect version!

@weiji14
Copy link
Member

weiji14 commented Oct 8, 2023

Thanks @seisman for spotting the bug, fixing it and adding the benchmark!

BTW: a test related to this bug will be added in PR #2720.

Could you try to add a unit test to test_clib.py under test_virtualfile_from_vectors_*? Want to make sure we're capturing this on the clib level and not just via fig.text.

@seisman
Copy link
Member Author

seisman commented Oct 9, 2023

Thanks @seisman for spotting the bug, fixing it and adding the benchmark!

BTW: a test related to this bug will be added in PR #2720.

Could you try to add a unit test to test_clib.py under test_virtualfile_from_vectors_*? Want to make sure we're capturing this on the clib level and not just via fig.text.

Modified the existing test test_virtualfile_from_vectors_two_string_or_object_columns to catch this bug.

With the main branch, the test fails:

E           AssertionError: assert '0\t5\ta pqrs...t9\tklmnolo\n' == '0\t5\ta pqrs...mnolooong $\n'
E               0	5	a pqrst
E               1	6	bc uvwx
E               2	7	def yz!
E               3	8	ghij @#
E             - 4	9	klmnolooong $
E             ?  	 	       ------
E             + 4	9	klmnolo

Done in 80e500c.

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the existing test test_virtualfile_from_vectors_two_string_or_object_columns to catch this bug.

Brilliant! Wish I had used a different string length back in #520 (you actually asked me to make variable length strings in #520 (comment), but I didn't make sure that string1 + string2 were different lengths) 😆

@seisman seisman removed the needs review This PR has higher priority and needs review. label Oct 9, 2023
@seisman seisman changed the title Fix the bug of elementwisely joining string arrays with a space clib: Fix the bug when passing two columns of strings with variable lengths to the GMT C API Oct 9, 2023
@seisman seisman changed the title clib: Fix the bug when passing two columns of strings with variable lengths to the GMT C API clib: Fix the bug when passing multiple columns of strings with variable lengths to the GMT C API Oct 9, 2023
@seisman seisman added the final review call This PR requires final review and approval from a second reviewer label Oct 9, 2023
@seisman
Copy link
Member Author

seisman commented Oct 9, 2023

I've revised the PR title to make it more descriptive as a changelog entry.

@seisman seisman merged commit 3d401bb into main Oct 9, 2023
@seisman seisman deleted the put-string branch October 9, 2023 03:58
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants