Fix bug when loading few columns of a dataset with many primary indices #446

mlondschien · 2021-04-01T10:22:06Z

Description:

On master:

In [1]: import pandas as pd
   ...: import storefact
   ...: from functools import partial
   ...: from kartothek.io.eager import store_dataframes_as_dataset, read_table
   ...: 
   ...: path = "/tmp"
   ...: uuid = "dataset"
   ...: 
   ...: df = pd.DataFrame({"w": [-1, -2], "x": [1, 2], "y": ["a", "b"], "z": [0.0, 1.1]})
   ...: 
   ...: store = partial(storefact.get_store_from_url, f"hfs://{path}?create_if_missing=False")
   ...: dm = store_dataframes_as_dataset(
   ...:     dfs=[df],
   ...:     dataset_uuid=uuid,
   ...:     store=store,
   ...:     partition_on=["w", "x", "y"],
   ...:     overwrite=True,
   ...: )
   ...: 
   ...: df = read_table(dataset_uuid=uuid, store=store, columns=["y", "z"])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-a5c31589801c> in <module>
     18 )
     19 
---> 20 df = read_table(dataset_uuid=uuid, store=store, columns=["y", "z"])

~/code/kartothek/kartothek/io/eager.py in read_table(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory)
    248         dataset_uuid=dataset_uuid, store=store, factory=factory,
    249     )
--> 250     partitions = read_dataset_as_dataframes(
    251         columns=columns,
    252         predicate_pushdown_to_io=predicate_pushdown_to_io,

~/code/kartothek/kartothek/io/eager.py in read_dataset_as_dataframes(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory, dispatch_by)
    135     )
    136 
--> 137     mps = read_dataset_as_metapartitions(
    138         columns=columns,
    139         predicate_pushdown_to_io=predicate_pushdown_to_io,

~/code/kartothek/kartothek/io/eager.py in read_dataset_as_metapartitions(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory, dispatch_by)
    202         dispatch_by=dispatch_by,
    203     )
--> 204     return list(ds_iter)
    205 
    206 

~/code/kartothek/kartothek/io/iter.py in read_dataset_as_metapartitions__iterator(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory, dispatch_by)
     87         else:
     88             mp = cast(MetaPartition, mp)
---> 89             mp = mp.load_dataframes(
     90                 store=store,
     91                 columns=columns,

~/code/kartothek/kartothek/io_components/metapartition.py in _impl(self, *method_args, **method_kwargs)
    148         else:
    149             for mp in self:
--> 150                 method_return = method(mp, *method_args, **method_kwargs)
    151                 if not isinstance(method_return, MetaPartition):
    152                     raise ValueError(

~/code/kartothek/kartothek/io_components/metapartition.py in load_dataframes(self, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates)
    701         # Metadata version >=4 parse the index columns and add them back to the dataframe
    702 
--> 703         df = self._reconstruct_index_columns(
    704             df=df,
    705             key_indices=indices,

~/code/kartothek/kartothek/io_components/metapartition.py in _reconstruct_index_columns(self, df, key_indices, columns, categories, date_as_object)
    800                 if convert_to_date:
    801                     value = pd.Timestamp(value).to_pydatetime().date()
--> 802             df.insert(pos, primary_key, value)
    803 
    804         return df

~/anaconda3/envs/kartothek/lib/python3.9/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   3622         self._ensure_valid_index(value)
   3623         value = self._sanitize_column(column, value, broadcast=False)
-> 3624         self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
   3625 
   3626     def assign(self, **kwargs) -> "DataFrame":

~/anaconda3/envs/kartothek/lib/python3.9/site-packages/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1204             self._blknos = np.append(self._blknos, len(self.blocks))
   1205         else:
-> 1206             self._blklocs = np.insert(self._blklocs, loc, 0)
   1207             self._blknos = np.insert(self._blknos, loc, len(self.blocks))
   1208 

<__array_function__ internals> in insert(*args, **kwargs)

~/anaconda3/envs/kartothek/lib/python3.9/site-packages/numpy/lib/function_base.py in insert(arr, obj, values, axis)
   4556         index = indices.item()
   4557         if index < -N or index > N:
-> 4558             raise IndexError(
   4559                 "index %i is out of bounds for axis %i with "
   4560                 "size %i" % (obj, axis, N))

IndexError: index 2 is out of bounds for axis 0 with size 1

In [2]:

This PR fixes this.

Changelog entry

fjetter

Can you please provide a test for this? From the code changes alone, I would argue that the behaviour is identical

mlondschien · 2021-04-01T14:34:26Z

The above code snipped could serve as a test. There are no tests for _reconstruct_index_columns, but I could add some.

fjetter · 2021-04-05T13:44:23Z

I'm just confused since I assume that the enumerate and the manual increase of the pos counter should be identical so I am not sure what this change is doing.

mlondschien · 2021-04-05T16:56:12Z

We don't increase pos if the condition in L772 holds true, as in the above example. One could also always insert at pos=0, since IIUC the column order gets shuffled later anyways.

mlondschien · 2021-04-13T15:18:40Z

@fjetter could you have another look at this?

fjetter · 2021-04-14T11:27:26Z

Sorry for the delay. Can you please rebase and ping when green?

mlondschien · 2021-04-14T12:37:53Z

Test failures are also on master.

Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)" This reverts commit fdc9779. Revert "fix mistakes in documentation" This reverts commit 4e4b5e0. Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)" This reverts commit d027ca2. Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)" This reverts commit 97cd553. Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)" This reverts commit e48d67a. Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)" This reverts commit 90ee486. Revert "Prepare release 4.0.1" This reverts commit b278503. Revert "Fix tests for dask dataframe and delayed backends" This reverts commit 5520f74. Revert "Add end-to-end regression test" This reverts commit 8a3e6ae. Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)" This reverts commit a26e840. Revert "Set release date for 4.0" This reverts commit 08a8094. Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)" This reverts commit 494732d. Revert "Add tests for non-default table (JDASoftwareGroup#440)" This reverts commit 3807a02. Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)" This reverts commit f7615ec. Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)" This reverts commit 75ffdb5. Revert "Remove inferred indices (JDASoftwareGroup#438)" This reverts commit b1e2535. Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)" This reverts commit b349cee. Revert "Remove all deprecated arguments (JDASoftwareGroup#434)" This reverts commit 74f0790. Revert "Remove multi table feature (JDASoftwareGroup#431)" This reverts commit 032856a.

Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (#466)" This reverts commit fdc9779. Revert "fix mistakes in documentation" This reverts commit 4e4b5e0. Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (#460)" This reverts commit d027ca2. Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (#461)" This reverts commit 97cd553. Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (#458)" This reverts commit e48d67a. Revert "Fix bug when loading few columns of a dataset with many primary indices (#446)" This reverts commit 90ee486. Revert "Prepare release 4.0.1" This reverts commit b278503. Revert "Fix tests for dask dataframe and delayed backends" This reverts commit 5520f74. Revert "Add end-to-end regression test" This reverts commit 8a3e6ae. Revert "Fix dataset corruption after updates (#445)" This reverts commit a26e840. Revert "Set release date for 4.0" This reverts commit 08a8094. Revert "Return dask scalar for store and update from ddf (#437)" This reverts commit 494732d. Revert "Add tests for non-default table (#440)" This reverts commit 3807a02. Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (#441)" This reverts commit f7615ec. Revert "Set default for dates_as_object to True (#436)" This reverts commit 75ffdb5. Revert "Remove inferred indices (#438)" This reverts commit b1e2535. Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (#422)" This reverts commit b349cee. Revert "Remove all deprecated arguments (#434)" This reverts commit 74f0790. Revert "Remove multi table feature (#431)" This reverts commit 032856a.

mlondschien added 3 commits April 1, 2021 12:13

Only increase pos if necessary.

f6cf787

Changelog.

bfc79c5

Empty line in changelog.

400db3a

fjetter requested changes Apr 1, 2021

View reviewed changes

mlondschien added 2 commits April 1, 2021 18:14

Add test.

891d10a

Don't link to private function.

f74cefb

Merge master.

e71fa1b

mlondschien requested a review from fjetter April 14, 2021 12:37

fjetter approved these changes Apr 14, 2021

View reviewed changes

fjetter merged commit 90ee486 into JDASoftwareGroup:master Apr 14, 2021

mlondschien deleted the only-insert-relevant-columns branch April 14, 2021 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug when loading few columns of a dataset with many primary indices #446

Fix bug when loading few columns of a dataset with many primary indices #446

mlondschien commented Apr 1, 2021

fjetter left a comment

mlondschien commented Apr 1, 2021

fjetter commented Apr 5, 2021

mlondschien commented Apr 5, 2021

mlondschien commented Apr 13, 2021

fjetter commented Apr 14, 2021 •

edited

Loading

mlondschien commented Apr 14, 2021

Fix bug when loading few columns of a dataset with many primary indices #446

Fix bug when loading few columns of a dataset with many primary indices #446

Conversation

mlondschien commented Apr 1, 2021

Description:

fjetter left a comment

Choose a reason for hiding this comment

mlondschien commented Apr 1, 2021

fjetter commented Apr 5, 2021

mlondschien commented Apr 5, 2021

mlondschien commented Apr 13, 2021

fjetter commented Apr 14, 2021 • edited Loading

mlondschien commented Apr 14, 2021

fjetter commented Apr 14, 2021 •

edited

Loading