Ensure updates on neither cube nor plain datasets corrupt indices #398

fjetter · 2021-02-02T16:29:07Z

Description:

This fixes an issue where cubes could remove or blacklist index columns and update a dataset/cube. This would then not update said index any longer leaving the dataset in a corrupt state. The only way to consistently deal with this would be to remove the index entirely or to update it. The behaviour for plain datasets, so far, was to update the index regardlessly if the input was omitted. I needed to change that logic slightly but I think the behaviour is still ok. This way the two interfaces area again a bit closer together.

The fix is very pragmatic but everything else would require a rework of the interface and refactoring of a lot of code :/

kartothek/io/dask/_shuffle.py

kartothek/io/dask/common_cube.py

jochen-ott-by · 2021-02-03T13:30:32Z

The overall approach of using a modified cube data structure internally seems really hacky. This would also mean that the cube-related metadata written to the dataset is not necessarily the one that can be directly derived from the cube object that is passed in.
Wouldn't it make more sense to clearly distinguish between public API data structures (such as cube) and internal data that might be derived from the current dataset state that can then be used to make the actual decision which indices to build / update.
Also, I'd like to know what the intended behavior is in the different possible cases that can arise here (concerning existence and declaration of indices), including what cube-related metadata will be stored in the dataset.

fjetter · 2021-02-03T13:42:04Z

I agree with the hacky nature of this approach but imho this boils down to the Cube as an interface is not a nice API for datasets/cubes with an extended lifetime and multiple owners.
at the very least indices should not be part of this interface since we can never know all indices up front. The entire idea of a cube is that it is possible to extend it with a new, possibly indexed, dataset and this breaks clearly the idea of having one instance of Cube describing the entire entity. Therefore, the only clean way for me would be to remove indices entirely from the Cube instance and to remove any automagical index creation but this would be a rather major change.

Distinguishing between public and internal structure is a proposal but I'm not willing to introduce yet another layer of indirection for a problem which should be rather solved be removing layers.

Also, I'd like to know what the intended behavior is in the different possible cases that can arise here (concerning existence and declaration of indices), including what cube-related metadata will be stored in the dataset.
from my POV this is rather straight forward. An update must never remove any indices, nor add any additional ones. Indices must be removed or added in a dedicated operation but never during an inplace update. At the very least this is not supported at the moment (could be if we wanted to but I'm not sure if it's worth the effort)

cube-related metadata will be stored in the dataset.

There is no index related cube specific metadata. this is all handled by plain kartothek and this change aligns the behaviour of cubes with plain datasets as much as possible

fjetter · 2021-02-03T13:43:43Z

FWIW I tried two different approaches with the intention of not modifying the cube. both escalated in either a major rewrite or required me to duplicate a lot of logic in many places. this was the "cleanest" solution without breaking stuff

jochen-ott-by · 2021-02-03T14:00:56Z

There is no index related cube specific metadata. this is all handled by plain kartothek and this change aligns the behaviour of cubes with plain datasets as much as possible

There is hardly any, yes, but the value of cube.supress_index_on is saved in the metadata.

fjetter · 2021-02-03T14:03:32Z

There is hardly any, yes, but the value of cube.supress_index_on is saved in the metadata.

Well, that slipped my review. That's technically a change in our storage specification and I don't think that's justified for this. I would propose to remove this again

jochen-ott-by · 2021-02-03T14:05:35Z

I agree with the hacky nature of this approach but imho this boils down to the Cube as an interface is not a nice API for datasets/cubes with an extended lifetime and multiple owners.
at the very least indices should not be part of this interface since we can never know all indices up front. The entire idea of a cube is that it is possible to extend it with a new, possibly indexed, dataset and this breaks clearly the idea of having one instance of Cube describing the entire entity. Therefore, the only clean way for me would be to remove indices entirely from the Cube instance and to remove any automagical index creation but this would be a rather major change.

So one idea would be to create this "next version of the cube" already and use it internally everywhere. For now, we could keep the current Cube for the user-facing APIs in order to not break too much.

Distinguishing between public and internal structure is a proposal but I'm not willing to introduce yet another layer of indirection for a problem which should be rather solved be removing layers.

This layer would serve a clear purpose of de-coupling the public API from the internals. If done right, it will make refactoring as you proposed above a lot easier. So it's not some arbitrary layer of indirection as you make it sound ...

jochen-ott-by · 2021-02-03T14:10:01Z

There is hardly any, yes, but the value of cube.supress_index_on is saved in the metadata.

Well, that slipped my review. That's technically a change in our storage specification and I don't think that's justified for this. I would propose to remove this again

Yes, I think this makes sense.
A question I'd have, then, is whether using the "discovered" cube to extend the dataset would work as intended, i.e. it would work without error and without an attempt to create any additional indices. I guess with the changes in this PR, it would work, but again it would be good to have this behavior clearly specified.

fjetter · 2021-02-03T14:21:17Z

So one idea would be to create this "next version of the cube" already and use it internally everywhere.

I honestly do not have an easy proposal for a "next version" to move forward here. I'm a bit worried about the Cube structure in general. While we probably all agree on the index columns to be misplaced here, a more difficult but not entirely decoupled question is what to do about the implicit features coupled to the dimension columns, for instance.

The dimension columns currently enforce

uniqueness
sorting
indexing
their existence in another table
?? (there is most likely something else)

is a nice property to have but I never was really fond of this considering the distributed nature of our data. The uniqueness can only be enforced on a single partition level and therefore feels somehow artificial. We even have code paths by now (I believe the dask.dataframe interface) which circumvent this
Different sorting is required for different kind of predicate pushdown scenarios and this an orthogonal topic to dimensions
As we've seen for the blacklisting, enforcing indexing on these column is not feasible
We've encountered reports from users (and I believe ourselves as well) that this constraint is sometimes too restrictive and we should rethink it

I believe some of these points where introduced to simplify the query planner but I'm not entirely convinced these are necessary or feasible and imho would require a bit more thorough analysis.

Since we're talking about the public interface, let's park the discussion about behaviour for a second. If we put all of this aside, quite frankly, I'm also very fond of the public API just being a str, namely the prefix of a cube, similarly to how datasets are dealt with. This also makes everything much easier to share and doesn't require to share code.

I think it's clear that this is a longer discussion which is not necessarily connected to this patch. I believe this is an important discussion to have but I think we should deal with this decoupled from this patch

fjetter · 2021-02-03T14:24:25Z

A question I'd have, then, is whether using the "discovered" cube to extend the dataset would work as intended, i.e. it would work without error and without an attempt to create any additional indices. I guess with the changes in this PR, it would work, but again it would be good to have this behavior clearly specified.

The intention is that the core kartothek spec is the single source of truth. Regardless of user input, it must never be possible to corrupt a dataset, I believe this is what we agree on.

The question is for me rather: should we raise if the user provides different input than expected or infer it from what's really stored? We settled on the "let's infer from storage" behaviour back in the days to land somewhere between usability and consistency. I believe it is irrelevant whether the cube was discovered or not. The "cube metadata" is for most kartothek operations actually ordinary user metadata (smth we might want to eventually change as well)

fjetter · 2021-02-03T15:08:08Z

I'll change the reliance on the written metadata in a follow up PR. I believe this can be done without breaking forward nor backwards compat.
Also, I believe it is not a big deal after all. The way I understand it, the written metadata would not align with the user input but it would always align with what was actually written, i.e. discover always returns the true state and I believe this is what matters

jochen-ott-by · 2021-02-03T16:01:18Z

kartothek/io_components/utils.py

@@ -19,6 +19,8 @@
 except ImportError:
    from typing import Literal  # type: ignore

+# Literal false is sentinel, see function body of `_ensure_compatible_indices` for details
+INFERRED_INDICES = Union[Literal[False], List[str]]


I think this should be camelcase, to be consistent with how python usually writes types: InferredIndices

jochen-ott-by · 2021-02-03T16:08:07Z

The intention is that the core kartothek spec is the single source of truth. Regardless of user input, it must never be possible to corrupt a dataset, I believe this is what we agree on.

The question is for me rather: should we raise if the user provides different input than expected or infer it from what's really stored? We settled on the "let's infer from storage" behaviour back in the days to land somewhere between usability and consistency.

I think it makes completely sense to never write inconsistent data. I think it's also fine to infer from the state rather than from user input, at least for indices (I'm not so sure for other cases, in which raising might be also a good option).

Is this documented somewhere? I think at least for indices, this should be part of the user-facing documentation, e.g. here: https://kartothek.readthedocs.io/en/stable/_rst/kartothek.io.dask.dataframe.html#kartothek.io.dask.dataframe.update_dataset_from_ddf (a link to the more general docs would probably be best)

fjetter · 2021-02-03T16:10:44Z

Is this documented somewhere?

Honestly, I do not know. I think the documentation (and module structure for that matter) is in a dismal state and it's currently very difficult to find for anything a single reliable source. partially because a lot is duplicated, but not reliably, and partially because it's just omitted.

(documentation builds are also broken, btw)

fjetter · 2021-02-03T16:13:39Z

I could envision a section in https://kartothek.readthedocs.io/en/latest/guide/mutating_datasets.html which is very high level and this is a very general truth and not particular to any specific backend

jochen-ott-by · 2021-02-04T06:03:22Z

Is this documented somewhere?

Honestly, I do not know.

I think we should fix this at least for this specific case. I'm also asking because it seems the intended behavior concerning indices at update is to simply keep all the current indices and not build any new ones (correct?). If this is the case, then there is not need for the secondary_indices argument to the update_* functions. This could also make control flow within the update functions much more straight-forward: there is no function to somehow derive the "real" list of incides from the argument secondary_indices; instead, just read what's there.
If my assumption is not correct, I want to know why also as part of the documentation.

fjetter · 2021-02-04T08:12:13Z

f this is the case, then there is not need for the secondary_indices argument to the update_* functions.

This is what we had at the very beginning of kartothek but it was a UX nightmare. Essentially every pipeline we built which used the update functionality looked like

if dataset.exists():
    # create
else:
    # update

where both function calls would look identically (except of one or two kwargs), so we merged the two into one. For the argument default_metadata_version it is straight forward how this mechanic works and it is clearly documented there that this only takes effect if there is no existing dataset. For all other parameters, I agree that this is not straight forward to understand. I'll try to find a nice place to document this...

fjetter · 2021-02-04T08:13:56Z

indices at update is to simply keep all the current indices and not build any new ones (correct?)

Indeed, building new ones would require us to read the entire dataset which might be incredibly expensive. However, this is a case we'd raise then, at least for the plain datasets. For cubes we cannot raise since we cannot reliably say whether the index is provided for a "future extension" or whether it is actually supposed to be there

docs/guide/mutating_datasets.rst

Co-authored-by: jochen-ott-by <jochen.ott@blueyonder.com>

jochen-ott-by reviewed Feb 3, 2021

View reviewed changes

kartothek/io/dask/_shuffle.py Outdated Show resolved Hide resolved

jochen-ott-by reviewed Feb 3, 2021

View reviewed changes

kartothek/io/dask/common_cube.py Outdated Show resolved Hide resolved

fjetter force-pushed the cube_update_index_consistency branch from 316d176 to d362dea Compare February 3, 2021 15:02

jochen-ott-by reviewed Feb 3, 2021

View reviewed changes

fjetter added 7 commits February 8, 2021 09:32

Ensure updates on neither cube nor plain datasets corrupt indices

5ae4379

Factor out inferred indices type

85224eb

Patch test_update

29d4255

use camelcase for defined types

6a4f0ad

Fix bug introduced when improving typing

92253b3

Handle the case where cube tries to extend a multi tabled dataset

ef41e61

Add doc section about mutating indexed datasets

4cbd840

fjetter force-pushed the cube_update_index_consistency branch from a2c4583 to 4cbd840 Compare February 8, 2021 09:53

jochen-ott-by reviewed Feb 8, 2021

View reviewed changes

docs/guide/mutating_datasets.rst Outdated Show resolved Hide resolved

Update docs/guide/mutating_datasets.rst

aa50ab3

Co-authored-by: jochen-ott-by <jochen.ott@blueyonder.com>

jochen-ott-by approved these changes Feb 8, 2021

View reviewed changes

fjetter merged commit 9daf294 into master Feb 8, 2021

fjetter deleted the cube_update_index_consistency branch February 8, 2021 12:07

stephan-hesselmann-by mentioned this pull request Feb 8, 2021

Add optional Parquet serializer to cube interface #404

Merged

2 tasks

lr4d mentioned this pull request Feb 16, 2021

Incorrect indices error when updating cube #413

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure updates on neither cube nor plain datasets corrupt indices #398

Ensure updates on neither cube nor plain datasets corrupt indices #398

fjetter commented Feb 2, 2021

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by commented Feb 3, 2021

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021 •

edited

Loading

fjetter commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by Feb 3, 2021

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by commented Feb 4, 2021

fjetter commented Feb 4, 2021

fjetter commented Feb 4, 2021

Ensure updates on neither cube nor plain datasets corrupt indices #398

Ensure updates on neither cube nor plain datasets corrupt indices #398

Conversation

fjetter commented Feb 2, 2021

Description:

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by commented Feb 3, 2021

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021 • edited Loading

fjetter commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by Feb 3, 2021

Choose a reason for hiding this comment

jochen-ott-by commented Feb 3, 2021

fjetter commented Feb 3, 2021

fjetter commented Feb 3, 2021

jochen-ott-by commented Feb 4, 2021

fjetter commented Feb 4, 2021

fjetter commented Feb 4, 2021

fjetter commented Feb 3, 2021 •

edited

Loading