Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: EADtype._find_compatible_dtype #53106

Closed
wants to merge 18 commits into from

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented May 5, 2023

Tests copied from #52833, so this may close #52235.

Needs a lot of cleanup, in particular docs. The ArrowDtype._find_compatible_dtype implementation is quite kludgy, would appreciate help from someone more knowledgeable.

I think _find_compatible_dtype is the "correct" long-term non-kludge solution for setitem-with-expansion.

@jbrockmendel jbrockmendel added setitem-with-expansion ExtensionArray Extending pandas with custom dtypes or arrays. labels May 5, 2023

def _maybe_promote(self, item: Any) -> tuple[DtypeObj, Any]:
if isinstance(item, pa.Scalar):
if not item.is_valid:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't NA always be able to be inserted into ArrowExtensionArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure. pyarrow nulls are typed, so we could plausibly want to disallow e.g. <pyarrow.TimestampScalar: None> in a pyarrow integer dtype

elif isna(value):
new_dtype = None
new_dtype = None
if is_list_like(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for ArrowDtype with pa.list_ type, we would want to treat value like a scalar e.g

ser = pd.Series([[1, 1]], dtype=pd.ArrowDtype(pa.list_(pa.int64())))
ser[0] = [1, 2]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, getting rid of this is_list_like check causes us to incorrectly raise on numpy non-object cases when using a list value (for which we don't have any tests). Can fix that in this PR or separately, as it is a bit more invasive.

item = item.as_py()

elif item is None or item is libmissing.NA:
# TODO: np.nan? use is_valid_na_for_dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pyarrow supports nan vs NA, possibly we want to allow nan if pa.types.is_floating(self.pyarrow_dtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what to do here depends on making a decision about when/how to distinguish between np.nan and pd.NA (which i hope to finally nail down at the sprint). doing this The Right Way would involve something like implementing EA._is_valid_na_for_dtype

# TODO: ask joris for help making these checks more robust
if item.type == self.pyarrow_dtype:
return self, item.as_py()
if item.type.to_pandas_dtype() == np.int64 and self.kind == "i":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed specifically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just to get the tests copied from #52833 passing.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 9, 2023
def _maybe_promote(self, item: Any) -> tuple[DtypeObj, Any]:
if isinstance(item, pa.Scalar):
if not item.is_valid:
# TODO: ask joris for help making these checks more robust
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche any thoughts here? (not time-sensitive)

@jbrockmendel
Copy link
Member Author

Updated with some added tests, changed EADtype._maybe_promote->_find_compatible_dtype.

There are some TODOs mostly for pyarrow and related to nan-vs-NA, but I don't think any of them are blockers.

@jbrockmendel jbrockmendel changed the title ENH: EADtype._maybe_promote ENH: EADtype._find_compatible_dtype Jul 28, 2023
@jbrockmendel
Copy link
Member Author

_find_compatible_dtype can be used to implement a default implementation of _is_valid_scalar_for, which can be used to implement _from_scalars (both discussed on this week's dev call).

@jbrockmendel jbrockmendel mentioned this pull request Jul 31, 2023
5 tasks
@@ -2326,3 +2347,29 @@ def __from_arrow__(self, array: pa.Array | pa.ChunkedArray):
array_class = self.construct_array_type()
arr = array.cast(self.pyarrow_dtype, safe=True)
return array_class(arr)

def _find_compatible_dtype(self, item: Any) -> tuple[DtypeObj, Any]:
if isinstance(item, pa.Scalar):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if item is null? The pyarrow null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC pyarrow nulls are now typed. id prefer to be strict about making these match, but dont care that much. am hoping @jorisvandenbossche will weigh in

[
(pa.scalar(4, type="int32"), 4, "int32[pyarrow]"),
(pa.scalar(4, type="int64"), 4, "int32[pyarrow]"),
# (pa.scalar(4.5, type="float64"), 4, "int32[pyarrow]"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here?

Also what happens with a int64 scalar and int32 dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id want to follow the same logic we do for numpy dtypes, but was punting here in expectation of doing it in a follow-up (likely involving joris expressing an opinion)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. setitem-with-expansion Stale
Projects
None yet
3 participants