WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578

jorisvandenbossche · 2021-02-03T14:19:53Z

xref indexing work item of #39146

This actually started as trying to get the tests passing for tests/frame/indexing, but while doing that I needed to make decisions on what the behaviour should be for setitem / loc / iloc assignment (so related other ongoing discussions about this).

Very much a draft, but a good way to see what we actually want / the consequences of those choices.

cc @jbrockmendel

…nplace

jbrockmendel · 2021-02-03T15:26:33Z

pandas/core/internals/array_manager.py

+            isinstance(arr, np.ndarray)
+            and is_datetime64_dtype(arr.dtype)
+            and is_scalar(value)
+            and isna(value)


pls use is_valid_nat_for_dtype, otherwise this will incorrectly let through td64 nat

(i know i know, this is just draft and youre not looking for comments like this, but this could easily fall through the cracks)

No, this is certainly a comment that is useful ;) (I was actually planning to comment on this and ask you, because I already realized the same would need to be done for timedelta, that just didn't occur in the tests)

Actually, in this case we explicitly want to include np.nan (which I think is_valid_nat_for_dtype doesn't do?), since we allow setting np.nan or None in a datetime column, but a datetime64[ns] numpy array doesn't allow this (and the same for timedelta64)

which I think is_valid_nat_for_dtype doesn't do?

incorrect

jbrockmendel · 2021-02-03T15:29:06Z

pandas/core/internals/array_manager.py

-    # TODO what is this used for?
-    # def setitem(self, indexer, value) -> ArrayManager:
-    #     return self.apply_with_block("setitem", indexer=indexer, value=value)
+    def setitem(self, indexer, value, column_idx) -> ArrayManager:


Block.setitem isn't always inplace, its a try-inplace-fallback-to-cast

Yep, but not in ArrayManager (I know I need to open an issue about that to discuss this in general -> #39584)

jbrockmendel · 2021-02-03T15:29:29Z

pandas/core/internals/array_manager.py

@@ -454,7 +471,7 @@ def is_mixed_type(self) -> bool:

    @property
    def is_numeric_mixed_type(self) -> bool:
-        return False
+        return all(is_numeric_dtype(t) for t in self.get_dtypes())


should the "mixed" be meaningful here, i.e. what if we're homogeneous dtype?

Then it will also return True, like it does for BlockManager (this basically only wants to know that all columns are numeric, while not necessarily being all the same dtype)

jbrockmendel · 2021-02-03T15:30:12Z

pandas/tests/frame/indexing/test_categorical.py

-        tm.assert_frame_equal(df, exp)
-
-    def test_loc_setitem_single_row_categorical(self):
+        if using_array_manager:


why is this desired behavior?

Pending outcome of the discussion in #39584, the "a" column has integer dtype and should preserve that, so string categorical cannot be set

jbrockmendel · 2021-02-03T15:31:01Z

pandas/tests/frame/indexing/test_indexing.py

@@ -579,6 +596,8 @@ def test_setitem_cast(self, float_frame):
        float_frame["something"] = 2.5
        assert float_frame["something"].dtype == np.float64

+    @td.skip_array_manager_invalid_test


can you comment on why the test is invalid? (maybe skip_array_manager_invalid_test could be callable with a "reason" keyword)

jbrockmendel · 2021-02-03T15:32:22Z

pandas/tests/frame/indexing/test_indexing.py

-        assert df["timestamp"].dtype == np.object_
-        assert df.loc["b", "timestamp"] == iNaT
+        if not using_array_manager:
+            # TODO(ArrayManager) setting iNaT in DatetimeArray actually sets NaT


one way to avoid this is be using ensure_wrapped_if_datetimelike in ArrayManager.setitem to ensure you have a DTA and never a ndarray[dt64]

jbrockmendel · 2021-02-04T16:27:04Z

pandas/core/indexers.py

-    elif isinstance(indexer, (ABCSeries, ABCIndex, np.ndarray, list)):
-        if isinstance(indexer, list):
+    elif isinstance(indexer, (ABCSeries, ABCIndex, np.ndarray, list, range)):
+        if isinstance(indexer, (list, range)):


the range case can be implemented separately + more efficiently, similar to the slice case

You mean that a range object should have been converted to a slice before calling length_of_indexer?

At the moment it's simply the _setitem_with_indexer_split_path case that calls that method.

Ah, but it's probably because of calling self.iloc._setitem_with_indexer(..) in frame.py instead of self.iloc[..] = .. that the conversion doesn't happen anymore (iloc.__setitem__ calls indexer = self._get_setitem_indexer(key) before passing the indexer to _setitem_with_indexer and the _get_setitem_indexer converts a range object in a list.

i was just thinking (indexer.start - indexer.stop) // indexer.step

does this not work?

jbrockmendel · 2021-02-12T23:52:02Z

pandas/conftest.py

    """
    DataFrame with 3 level MultiIndex (year, month, day) covering
    first 100 business days from 2000-01-01 with random data
    """
+    if using_array_manager:
+        # TODO(ArrayManager) groupby
+        pytest.skip("Not yet implemented for ArrayManager")


can the decorator be used for this?

jbrockmendel · 2021-02-12T23:53:11Z

pandas/core/indexing.py

@@ -1797,7 +1799,7 @@ def _setitem_with_indexer_frame_value(self, indexer, value: DataFrame, name: str

                self._setitem_single_column(loc, val, pi)

-    def _setitem_single_column(self, loc: int, value, plane_indexer):
+    def _setitem_single_column(self, loc: int, value, plane_indexer, overwrite=True):


jbrockmendel · 2021-02-12T23:53:30Z

pandas/core/indexing.py

@@ -1728,7 +1730,7 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str):

            # scalar value
            for loc in ilocs:
-                self._setitem_single_column(loc, value, pi)
+                self._setitem_single_column(loc, value, pi, overwrite=name == "setitem")


maybe define overwrite once earlier on?

jbrockmendel · 2021-02-12T23:57:31Z

pandas/core/indexing.py

@@ -1684,7 +1684,9 @@ def _setitem_with_indexer_split_path(self, indexer, value, name: str):

            elif len(ilocs) == 1 and lplane_indexer == len(value) and not is_scalar(pi):
                # We are setting multiple rows in a single column.
-                self._setitem_single_column(ilocs[0], value, pi)
+                self._setitem_single_column(
+                    ilocs[0], value, pi, overwrite=name == "setitem"


i dont think we ever get name=="setitem" ATM, maybe update the docstring to describe what this indicates?

jbrockmendel · 2021-02-12T23:59:13Z

This has some overlap with #39510; would that solve this issue?

jbrockmendel · 2021-03-28T16:29:53Z

@jorisvandenbossche is #40380 helpful for this? if not im going to mothball it

jorisvandenbossche · 2021-04-22T21:32:53Z

Sorry for the slow reply here. I need to look into it again, but in any case, a part of the test changes of this PR was broken off and is already in master (#40323, #40325).

Another part of the code changes here were related to making loc and iloc update in place (#39584)

I need to look again at #40380 to know if it would be helpful for this.

(I generally plan to look at the indexing code starting next week, for the copy-on-write exploration)

github-actions · 2021-05-26T00:08:18Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-11-28T21:04:01Z

this is quite old, happen to reopen if actively worked on.

[ArrayManager] API: setitem to set new columns / loc+iloc to update i…

3220c80

…nplace

jbrockmendel reviewed Feb 3, 2021

View reviewed changes

get tests/indexing/ passing

803e60e

jbrockmendel reviewed Feb 4, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into am-indexing-setitem

c72240c

jorisvandenbossche mentioned this pull request Feb 8, 2021

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

Closed

Merge remote-tracking branch 'upstream/master' into am-indexing-setitem

69a6aad

jorisvandenbossche mentioned this pull request Feb 12, 2021

BUG: setitem with ArrayManager not preserving PeriodDtype #39763

Closed

jbrockmendel reviewed Feb 12, 2021

View reviewed changes

jbrockmendel mentioned this pull request Feb 18, 2021

REF/API: DataFrame.__setitem__ never operate in-place #39510

Merged

4 tasks

This was referenced Mar 9, 2021

[ArrayManager] TST: run (+fix/skip) pandas/tests/frame/indexing tests #40323

Merged

[ArrayManager] TST: run (+fix/skip) pandas/tests/indexing tests #40325

Merged

simonjayhawkins added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 14, 2021

github-actions bot added the Stale label May 26, 2021

jreback closed this Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578

WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578

jorisvandenbossche commented Feb 3, 2021

jbrockmendel Feb 3, 2021

jorisvandenbossche Feb 3, 2021

jorisvandenbossche Feb 5, 2021

jbrockmendel Feb 5, 2021

jbrockmendel Feb 3, 2021

jorisvandenbossche Feb 3, 2021 •

edited

Loading

jbrockmendel Feb 3, 2021

jorisvandenbossche Feb 3, 2021

jbrockmendel Feb 3, 2021

jorisvandenbossche Feb 4, 2021

jbrockmendel Feb 3, 2021

jbrockmendel Feb 3, 2021

jbrockmendel Feb 4, 2021

jorisvandenbossche Feb 4, 2021

jorisvandenbossche Feb 4, 2021

jbrockmendel Feb 4, 2021

jbrockmendel Feb 12, 2021

jbrockmendel Feb 12, 2021

jbrockmendel Feb 12, 2021

jbrockmendel Feb 12, 2021

jbrockmendel Feb 12, 2021

jbrockmendel commented Feb 12, 2021

jbrockmendel commented Mar 28, 2021

jorisvandenbossche commented Apr 22, 2021

github-actions bot commented May 26, 2021

jreback commented Nov 28, 2021

WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578

WIP [ArrayManager] API: setitem to set new columns / loc+iloc to update inplace #39578

Conversation

jorisvandenbossche commented Feb 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Feb 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Feb 12, 2021

jbrockmendel commented Mar 28, 2021

jorisvandenbossche commented Apr 22, 2021

github-actions bot commented May 26, 2021

jreback commented Nov 28, 2021

jorisvandenbossche Feb 3, 2021 •

edited

Loading