Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (string dtype): convert dictionary input to materialized string array in ArrowStringArray constructor #59479

Conversation

jorisvandenbossche
Copy link
Member

#54074 allowed to pass dictionary encoded string data in the ArrowStringArray constructor, because such values can be passed when reading partitioned datasets (it was fixing #53951).

However, when you actually have a column with string dtype but backed by a dictionary encoded pyarrow array, our StringArray implemention is not set up for that, so once you call some string specific functionality, you can run into errors. Example:

In [2]: arr = pd.core.arrays.ArrowStringArray(pa.array(["a", "b", None, "a"], pa.large_string()).dictionary_encode())

In [3]: arr
Out[3]: 
<ArrowStringArray>
['a', 'b', <NA>, 'a']
Length: 4, dtype: string

In [4]: pd.Series(arr)
Out[4]: 
0       a
1       b
2    <NA>
3       a
dtype: string

In [5]: pd.Series(arr).str.len()
...
ArrowNotImplementedError: Function 'utf8_length' has no kernel matching input types (dictionary<values=large_string, indices=int32, ordered=0>)

Maybe we could at some point properly support storing dict encoded string values under the hood. But until then, I think it is better to materialize the dictionary encoded input to a plain pyarrow string array, which is what this PR is doing.

  • Tests added and passed if fixing a bug or adding a new feature
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Aug 11, 2024
@mroeschke mroeschke added this to the 3.0 milestone Aug 12, 2024
@mroeschke mroeschke merged commit f0b8db4 into pandas-dev:main Aug 12, 2024
51 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-arrow-disallow-dict-storage branch August 12, 2024 17:58
WillAyd pushed a commit that referenced this pull request Aug 13, 2024
shreyas-dev pushed a commit to shreyas-dev/pandas that referenced this pull request Aug 13, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 14, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 15, 2024
@jorisvandenbossche jorisvandenbossche modified the milestones: 3.0, 2.3 Aug 20, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 21, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 22, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Aug 27, 2024
WillAyd pushed a commit to WillAyd/pandas that referenced this pull request Sep 20, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 2, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 3, 2024
jorisvandenbossche added a commit to WillAyd/pandas that referenced this pull request Oct 7, 2024
jorisvandenbossche added a commit that referenced this pull request Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants