Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578

Conversation

jorisvandenbossche
Copy link
Member

xref indexing work item of #39146

This actually started as trying to get the tests passing for tests/frame/indexing, but while doing that I needed to make decisions on what the behaviour should be for setitem / loc / iloc assignment (so related other ongoing discussions about this).

Very much a draft, but a good way to see what we actually want / the consequences of those choices.

cc @jbrockmendel

isinstance(arr, np.ndarray)
and is_datetime64_dtype(arr.dtype)
and is_scalar(value)
and isna(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use is_valid_nat_for_dtype, otherwise this will incorrectly let through td64 nat

(i know i know, this is just draft and youre not looking for comments like this, but this could easily fall through the cracks)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is certainly a comment that is useful ;) (I was actually planning to comment on this and ask you, because I already realized the same would need to be done for timedelta, that just didn't occur in the tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in this case we explicitly want to include np.nan (which I think is_valid_nat_for_dtype doesn't do?), since we allow setting np.nan or None in a datetime column, but a datetime64[ns] numpy array doesn't allow this (and the same for timedelta64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which I think is_valid_nat_for_dtype doesn't do?

incorrect

# TODO what is this used for?
# def setitem(self, indexer, value) -> ArrayManager:
# return self.apply_with_block("setitem", indexer=indexer, value=value)
def setitem(self, indexer, value, column_idx) -> ArrayManager:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block.setitem isn't always inplace, its a try-inplace-fallback-to-cast

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but not in ArrayManager (I know I need to open an issue about that to discuss this in general -> #39584)

@@ -454,7 +471,7 @@ def is_mixed_type(self) -> bool:

@property
def is_numeric_mixed_type(self) -> bool:
return False
return all(is_numeric_dtype(t) for t in self.get_dtypes())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the "mixed" be meaningful here, i.e. what if we're homogeneous dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will also return True, like it does for BlockManager (this basically only wants to know that all columns are numeric, while not necessarily being all the same dtype)

tm.assert_frame_equal(df, exp)

def test_loc_setitem_single_row_categorical(self):
if using_array_manager:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this desired behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending outcome of the discussion in #39584, the "a" column has integer dtype and should preserve that, so string categorical cannot be set

@@ -579,6 +596,8 @@ def test_setitem_cast(self, float_frame):
float_frame["something"] = 2.5
assert float_frame["something"].dtype == np.float64

@td.skip_array_manager_invalid_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on why the test is invalid? (maybe skip_array_manager_invalid_test could be callable with a "reason" keyword)

assert df["timestamp"].dtype == np.object_
assert df.loc["b", "timestamp"] == iNaT
if not using_array_manager:
# TODO(ArrayManager) setting iNaT in DatetimeArray actually sets NaT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one way to avoid this is be using ensure_wrapped_if_datetimelike in ArrayManager.setitem to ensure you have a DTA and never a ndarray[dt64]

elif isinstance(indexer, (ABCSeries, ABCIndex, np.ndarray, list)):
if isinstance(indexer, list):
elif isinstance(indexer, (ABCSeries, ABCIndex, np.ndarray, list, range)):
if isinstance(indexer, (list, range)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the range case can be implemented separately + more efficiently, similar to the slice case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that a range object should have been converted to a slice before calling length_of_indexer?

At the moment it's simply the _setitem_with_indexer_split_path case that calls that method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but it's probably because of calling self.iloc._setitem_with_indexer(..) in frame.py instead of self.iloc[..] = .. that the conversion doesn't happen anymore (iloc.__setitem__ calls indexer = self._get_setitem_indexer(key) before passing the indexer to _setitem_with_indexer and the _get_setitem_indexer converts a range object in a list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was just thinking (indexer.start - indexer.stop) // indexer.step

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not work?

"""
DataFrame with 3 level MultiIndex (year, month, day) covering
first 100 business days from 2000-01-01 with random data
"""
if using_array_manager:
# TODO(ArrayManager) groupby
pytest.skip("Not yet implemented for ArrayManager")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the decorator be used for this?

@@ -1797,7 +1799,7 @@ def _setitem_with_indexer_frame_value(self, indexer, value: DataFrame, name: str

self._setitem_single_column(loc, val, pi)

def _setitem_single_column(self, loc: int, value, plane_indexer):
def _setitem_single_column(self, loc: int, value, plane_indexer, overwrite=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotate

@@ -1728,7 +1730,7 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str):

# scalar value
for loc in ilocs:
self._setitem_single_column(loc, value, pi)
self._setitem_single_column(loc, value, pi, overwrite=name == "setitem")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe define overwrite once earlier on?

@@ -1684,7 +1684,9 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str):

elif len(ilocs) == 1 and lplane_indexer == len(value) and not is_scalar(pi):
# We are setting multiple rows in a single column.
self._setitem_single_column(ilocs[0], value, pi)
self._setitem_single_column(
ilocs[0], value, pi, overwrite=name == "setitem"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we ever get name=="setitem" ATM, maybe update the docstring to describe what this indicates?

@jbrockmendel
Copy link
Member

This has some overlap with #39510; would that solve this issue?

@jbrockmendel
Copy link
Member

@jorisvandenbossche is #40380 helpful for this? if not im going to mothball it

@jorisvandenbossche
Copy link
Member Author

Sorry for the slow reply here. I need to look into it again, but in any case, a part of the test changes of this PR was broken off and is already in master (#40323, #40325).

Another part of the code changes here were related to making loc and iloc update in place (#39584)

I need to look again at #40380 to know if it would be helpful for this.

(I generally plan to look at the indexing code starting next week, for the copy-on-write exploration)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 26, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

this is quite old, happen to reopen if actively worked on.

@jreback jreback closed this Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants