Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure updates on neither cube nor plain datasets corrupt indices #398

Merged
merged 8 commits into from
Feb 8, 2021

Conversation

fjetter
Copy link
Collaborator

@fjetter fjetter commented Feb 2, 2021

Description:

This fixes an issue where cubes could remove or blacklist index columns and update a dataset/cube. This would then not update said index any longer leaving the dataset in a corrupt state. The only way to consistently deal with this would be to remove the index entirely or to update it. The behaviour for plain datasets, so far, was to update the index regardlessly if the input was omitted. I needed to change that logic slightly but I think the behaviour is still ok. This way the two interfaces area again a bit closer together.

The fix is very pragmatic but everything else would require a rework of the interface and refactoring of a lot of code :/

@jochen-ott-by
Copy link
Contributor

The overall approach of using a modified cube data structure internally seems really hacky. This would also mean that the cube-related metadata written to the dataset is not necessarily the one that can be directly derived from the cube object that is passed in.
Wouldn't it make more sense to clearly distinguish between public API data structures (such as cube) and internal data that might be derived from the current dataset state that can then be used to make the actual decision which indices to build / update.
Also, I'd like to know what the intended behavior is in the different possible cases that can arise here (concerning existence and declaration of indices), including what cube-related metadata will be stored in the dataset.

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

I agree with the hacky nature of this approach but imho this boils down to the Cube as an interface is not a nice API for datasets/cubes with an extended lifetime and multiple owners.
at the very least indices should not be part of this interface since we can never know all indices up front. The entire idea of a cube is that it is possible to extend it with a new, possibly indexed, dataset and this breaks clearly the idea of having one instance of Cube describing the entire entity. Therefore, the only clean way for me would be to remove indices entirely from the Cube instance and to remove any automagical index creation but this would be a rather major change.

Distinguishing between public and internal structure is a proposal but I'm not willing to introduce yet another layer of indirection for a problem which should be rather solved be removing layers.

Also, I'd like to know what the intended behavior is in the different possible cases that can arise here (concerning existence and declaration of indices), including what cube-related metadata will be stored in the dataset.
from my POV this is rather straight forward. An update must never remove any indices, nor add any additional ones. Indices must be removed or added in a dedicated operation but never during an inplace update. At the very least this is not supported at the moment (could be if we wanted to but I'm not sure if it's worth the effort)

cube-related metadata will be stored in the dataset.

There is no index related cube specific metadata. this is all handled by plain kartothek and this change aligns the behaviour of cubes with plain datasets as much as possible

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

FWIW I tried two different approaches with the intention of not modifying the cube. both escalated in either a major rewrite or required me to duplicate a lot of logic in many places. this was the "cleanest" solution without breaking stuff

@jochen-ott-by
Copy link
Contributor

There is no index related cube specific metadata. this is all handled by plain kartothek and this change aligns the behaviour of cubes with plain datasets as much as possible

There is hardly any, yes, but the value of cube.supress_index_on is saved in the metadata.

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

There is hardly any, yes, but the value of cube.supress_index_on is saved in the metadata.

Well, that slipped my review. That's technically a change in our storage specification and I don't think that's justified for this. I would propose to remove this again

@jochen-ott-by
Copy link
Contributor

I agree with the hacky nature of this approach but imho this boils down to the Cube as an interface is not a nice API for datasets/cubes with an extended lifetime and multiple owners.
at the very least indices should not be part of this interface since we can never know all indices up front. The entire idea of a cube is that it is possible to extend it with a new, possibly indexed, dataset and this breaks clearly the idea of having one instance of Cube describing the entire entity. Therefore, the only clean way for me would be to remove indices entirely from the Cube instance and to remove any automagical index creation but this would be a rather major change.

So one idea would be to create this "next version of the cube" already and use it internally everywhere. For now, we could keep the current Cube for the user-facing APIs in order to not break too much.

Distinguishing between public and internal structure is a proposal but I'm not willing to introduce yet another layer of indirection for a problem which should be rather solved be removing layers.

This layer would serve a clear purpose of de-coupling the public API from the internals. If done right, it will make refactoring as you proposed above a lot easier. So it's not some arbitrary layer of indirection as you make it sound ...

@jochen-ott-by
Copy link
Contributor

There is hardly any, yes, but the value of cube.supress_index_on is saved in the metadata.

Well, that slipped my review. That's technically a change in our storage specification and I don't think that's justified for this. I would propose to remove this again

Yes, I think this makes sense.
A question I'd have, then, is whether using the "discovered" cube to extend the dataset would work as intended, i.e. it would work without error and without an attempt to create any additional indices. I guess with the changes in this PR, it would work, but again it would be good to have this behavior clearly specified.

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

So one idea would be to create this "next version of the cube" already and use it internally everywhere.

I honestly do not have an easy proposal for a "next version" to move forward here. I'm a bit worried about the Cube structure in general. While we probably all agree on the index columns to be misplaced here, a more difficult but not entirely decoupled question is what to do about the implicit features coupled to the dimension columns, for instance.

The dimension columns currently enforce

  1. uniqueness
  2. sorting
  3. indexing
  4. their existence in another table
  5. ?? (there is most likely something else)

  1. is a nice property to have but I never was really fond of this considering the distributed nature of our data. The uniqueness can only be enforced on a single partition level and therefore feels somehow artificial. We even have code paths by now (I believe the dask.dataframe interface) which circumvent this
  2. Different sorting is required for different kind of predicate pushdown scenarios and this an orthogonal topic to dimensions
  3. As we've seen for the blacklisting, enforcing indexing on these column is not feasible
  4. We've encountered reports from users (and I believe ourselves as well) that this constraint is sometimes too restrictive and we should rethink it

I believe some of these points where introduced to simplify the query planner but I'm not entirely convinced these are necessary or feasible and imho would require a bit more thorough analysis.

Since we're talking about the public interface, let's park the discussion about behaviour for a second. If we put all of this aside, quite frankly, I'm also very fond of the public API just being a str, namely the prefix of a cube, similarly to how datasets are dealt with. This also makes everything much easier to share and doesn't require to share code.

I think it's clear that this is a longer discussion which is not necessarily connected to this patch. I believe this is an important discussion to have but I think we should deal with this decoupled from this patch

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

A question I'd have, then, is whether using the "discovered" cube to extend the dataset would work as intended, i.e. it would work without error and without an attempt to create any additional indices. I guess with the changes in this PR, it would work, but again it would be good to have this behavior clearly specified.

The intention is that the core kartothek spec is the single source of truth. Regardless of user input, it must never be possible to corrupt a dataset, I believe this is what we agree on.

The question is for me rather: should we raise if the user provides different input than expected or infer it from what's really stored? We settled on the "let's infer from storage" behaviour back in the days to land somewhere between usability and consistency. I believe it is irrelevant whether the cube was discovered or not. The "cube metadata" is for most kartothek operations actually ordinary user metadata (smth we might want to eventually change as well)

@fjetter fjetter force-pushed the cube_update_index_consistency branch from 316d176 to d362dea Compare February 3, 2021 15:02
@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

I'll change the reliance on the written metadata in a follow up PR. I believe this can be done without breaking forward nor backwards compat.
Also, I believe it is not a big deal after all. The way I understand it, the written metadata would not align with the user input but it would always align with what was actually written, i.e. discover always returns the true state and I believe this is what matters

@@ -19,6 +19,8 @@
except ImportError:
from typing import Literal # type: ignore

# Literal false is sentinel, see function body of `_ensure_compatible_indices` for details
INFERRED_INDICES = Union[Literal[False], List[str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be camelcase, to be consistent with how python usually writes types: InferredIndices

@jochen-ott-by
Copy link
Contributor

The intention is that the core kartothek spec is the single source of truth. Regardless of user input, it must never be possible to corrupt a dataset, I believe this is what we agree on.

The question is for me rather: should we raise if the user provides different input than expected or infer it from what's really stored? We settled on the "let's infer from storage" behaviour back in the days to land somewhere between usability and consistency.

I think it makes completely sense to never write inconsistent data. I think it's also fine to infer from the state rather than from user input, at least for indices (I'm not so sure for other cases, in which raising might be also a good option).

Is this documented somewhere? I think at least for indices, this should be part of the user-facing documentation, e.g. here: https://kartothek.readthedocs.io/en/stable/_rst/kartothek.io.dask.dataframe.html#kartothek.io.dask.dataframe.update_dataset_from_ddf (a link to the more general docs would probably be best)

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

Is this documented somewhere?

Honestly, I do not know. I think the documentation (and module structure for that matter) is in a dismal state and it's currently very difficult to find for anything a single reliable source. partially because a lot is duplicated, but not reliably, and partially because it's just omitted.

(documentation builds are also broken, btw)

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 3, 2021

I could envision a section in https://kartothek.readthedocs.io/en/latest/guide/mutating_datasets.html which is very high level and this is a very general truth and not particular to any specific backend

@jochen-ott-by
Copy link
Contributor

Is this documented somewhere?

Honestly, I do not know.

I think we should fix this at least for this specific case. I'm also asking because it seems the intended behavior concerning indices at update is to simply keep all the current indices and not build any new ones (correct?). If this is the case, then there is not need for the secondary_indices argument to the update_* functions. This could also make control flow within the update functions much more straight-forward: there is no function to somehow derive the "real" list of incides from the argument secondary_indices; instead, just read what's there.
If my assumption is not correct, I want to know why also as part of the documentation.

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 4, 2021

f this is the case, then there is not need for the secondary_indices argument to the update_* functions.

This is what we had at the very beginning of kartothek but it was a UX nightmare. Essentially every pipeline we built which used the update functionality looked like

if dataset.exists():
    # create
else:
    # update

where both function calls would look identically (except of one or two kwargs), so we merged the two into one. For the argument default_metadata_version it is straight forward how this mechanic works and it is clearly documented there that this only takes effect if there is no existing dataset. For all other parameters, I agree that this is not straight forward to understand. I'll try to find a nice place to document this...

@fjetter
Copy link
Collaborator Author

fjetter commented Feb 4, 2021

indices at update is to simply keep all the current indices and not build any new ones (correct?)

Indeed, building new ones would require us to read the entire dataset which might be incredibly expensive. However, this is a case we'd raise then, at least for the plain datasets. For cubes we cannot raise since we cannot reliably say whether the index is provided for a "future extension" or whether it is actually supposed to be there

@fjetter fjetter force-pushed the cube_update_index_consistency branch from a2c4583 to 4cbd840 Compare February 8, 2021 09:53
Co-authored-by: jochen-ott-by <jochen.ott@blueyonder.com>
@fjetter fjetter merged commit 9daf294 into master Feb 8, 2021
@fjetter fjetter deleted the cube_update_index_consistency branch February 8, 2021 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants