Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix column_arrays for array manager #45001

Merged
merged 6 commits into from
Dec 28, 2021
Merged

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Dec 21, 2021

Note - I was doing this in parallel with @phofl skipping the JSON tests for the array manager. So I undid those changes so that the JSON code tests with array manager are done.

@@ -794,7 +794,14 @@ def column_arrays(self) -> list[ArrayLike]:
"""
Used in the JSON C code to access column arrays.
"""
return self.arrays

def convert_array(arr: ArrayLike) -> ArrayLike:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_array ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work to convert the ExtensionArray to a numpy array:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.4.0.dev0+1440.g5ee2c0229b.dirty'

In [3]: from pandas.core.construction import extract_array

In [4]: a=pd.array([1,2,3])

In [5]: a
Out[5]:
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64

In [6]: extract_array(a, True)
Out[6]:
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64

That may be a bug in the implementation of extract_array() . Should I fix that and use it instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be a bug in the implementation of extract_array() . Should I fix that and use it instead?

No, that is behaving as intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be a bug in the implementation of extract_array() . Should I fix that and use it instead?

No, that is behaving as intended.

OK, the docs aren't clear, because the parameter extract_numpy=True then only works for numpy backed extension arrays, otherwise it is ignored for other EA types (e.g., Categorical)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy backed extension arrays

yep, that is poor wording on the 'obj' arg; the 'extract_numpy' part of the docstring gets it right. ill write myself a note to fix the wording.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't np.asarray the right answer here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't np.asarray the right answer here?

That would work, yes.

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Dec 21, 2021
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 22, 2021

@jreback I think this is good to go - all failures are related to the distutils issue.

@jbrockmendel
Copy link
Member

@Dr-Irv can you merge master

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 28, 2021

@Dr-Irv can you merge master

Done. One of the tests timed out - should I try again?

@jreback
Copy link
Contributor

jreback commented Dec 28, 2021

@Dr-Irv can you merge master

Done. One of the tests timed out - should I try again?

umm can you just use np.asarray?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 28, 2021

umm can you just use np.asarray?

An ExtensionArray has the to_numpy() method, which also handles NA values correctly. The default implementation does use np.asarray(), but someone implementing an EA subclass has the option of overriding the implementation, so it seems we should use that instead.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2021

ok can u see if we repeat this code elsewhere iirc o have seen this a few times and don't want to reinvent the wheel yet again

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 28, 2021

ok can u see if we repeat this code elsewhere iirc o have seen this a few times and don't want to reinvent the wheel yet again

There is one place:

pandas/core/arrays/_mixins.py has class NDArrayBackedExtensionArray with method _validate_searchsorted_value() that is identical to convert_array() above. But the classes NDArrayBackedExtensionArray and ArrayManager are unrelated.

So a solution would be to move convert_array() to pandas/core/common.py, and change calls to _validate_search_sorted() to use that convert_array() instead. Let me know if I should do that.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2021

ok can u see if we repeat this code elsewhere iirc o have seen this a few times and don't want to reinvent the wheel yet again

There is one place:

pandas/core/arrays/_mixins.py has class NDArrayBackedExtensionArray with method _validate_searchsorted_value() that is identical to convert_array() above. But the classes NDArrayBackedExtensionArray and ArrayManager are unrelated.

So a solution would be to move convert_array() to pandas/core/common.py, and change calls to _validate_search_sorted() to use that convert_array() instead. Let me know if I should do that.

yep let's do it, though I think you can leave the convert_array function in core/arrays/_mixins (or similar) @jbrockmendel could suggest a good location

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 28, 2021

yep let's do it, though I think you can leave the convert_array function in core/arrays/_mixins (or similar) @jbrockmendel could suggest a good location

I will await the suggested location from @jbrockmendel before making that change

@jbrockmendel
Copy link
Member

ExtensionBlock.values_for_json just uses np.asarray(self.values), so i dont think the distinction with to_numpy makes a difference. DatetimeTZBlock.values_for_json does something more performant which you could reuse.

Just use np.asarray and be done with it.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 28, 2021

Just use np.asarray and be done with it.

OK that is in commit 178b2dd

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Dec 28, 2021

@jbrockmendel @jreback this is now all green and just uses np.asarray()

@mroeschke mroeschke added this to the 1.4 milestone Dec 28, 2021
@jreback jreback merged commit 61bcdff into pandas-dev:master Dec 28, 2021
@jreback
Copy link
Contributor

jreback commented Dec 28, 2021

thanks @Dr-Irv

@Dr-Irv Dr-Irv deleted the issue44994 branch February 13, 2023 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: Test experimental data manager failing due to new JSON format for ExtensionArray dtypes
4 participants