Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrowStringArray] fix test_astype_int, test_astype_float #41018

Merged
merged 4 commits into from
May 31, 2021

Conversation

simonjayhawkins
Copy link
Member

No description provided.

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Apr 18, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Apr 18, 2021
@simonjayhawkins simonjayhawkins marked this pull request as draft April 19, 2021 14:03
return self

elif hasattr(dtype, "__from_arrow__"):
return dtype.__from_arrow__(self._data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use this method like this: __from_arrow__ is meant to convert data from arrow for that specific dtype, and doesn't necessarily need to include any casting logic (I think)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative logic could be something like:

elif isinstance(dtype, NumericDtype):
    data = self._data.cast(pa.from_numpy_dtype(dtype.numpy_dtype))
    return dtype.__from_arrow__(data)

that would specifically work for the numeric masked arrays, and not rely on __from_arrow__ to do the casting (but rely on pyarrow for that).

This would already support casting to nullable integer / float to get the tests passing.

An example why we can rely in general on __from_arrow__ is eg if we would do here string_array.astype(pd.PeriodDtype("D")), which would fail in ``from_arrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example why we can rely in general on __from_arrow__ is eg if we would do here string_array.astype(pd.PeriodDtype("D")), which would fail in ``from_arrow

indeed, we have a similar with StringArray as well #40566 where

        elif isinstance(dtype, ExtensionDtype):
            cls = dtype.construct_array_type()
            return cls._from_sequence(self, dtype=dtype, copy=copy)

in many cases fails as _from_sequence, like __from_arrow__, does not support casting in many cases.

@simonjayhawkins
Copy link
Member Author

the changes to to_numpy are implicitly tested (and different behavior to StringArray). will need to add tests to explicitly tests this with the changes requested here. will open separate PRs for to_numpy/astype and close this to clear the queue.

@simonjayhawkins simonjayhawkins marked this pull request as ready for review May 31, 2021 14:46
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, some comments for followups

@@ -737,6 +745,24 @@ def value_counts(self, dropna: bool = True) -> Series:

return Series(counts, index=index).astype("Int64")

def astype(self, dtype, copy=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type here (followups are ok)

expected = np.array([1, 2, 3], dtype="int64")
tm.assert_numpy_array_equal(result, expected)

arr = pd.array(["1", pd.NA, "3"], dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob best in a dedicate _errors test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. done in simonjayhawkins@e1577d4

will open a PR with other follow-ups

@jreback jreback merged commit b117ab5 into pandas-dev:master May 31, 2021
@jreback
Copy link
Contributor

jreback commented May 31, 2021

merging, @simonjayhawkins if you can note the followups

@simonjayhawkins simonjayhawkins deleted the ArrowStringArray.astype branch May 31, 2021 17:11
@simonjayhawkins
Copy link
Member Author

merging, @simonjayhawkins if you can note the followups

sure. The typing is included in #35169 (comment) and will be done after the work by @Dr-Irv sorting out the base EA types.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request May 31, 2021
TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants