-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure updates on neither cube nor plain datasets corrupt indices #398
Conversation
The overall approach of using a modified |
I agree with the hacky nature of this approach but imho this boils down to the Distinguishing between public and internal structure is a proposal but I'm not willing to introduce yet another layer of indirection for a problem which should be rather solved be removing layers.
There is no index related cube specific metadata. this is all handled by plain kartothek and this change aligns the behaviour of cubes with plain datasets as much as possible |
FWIW I tried two different approaches with the intention of not modifying the cube. both escalated in either a major rewrite or required me to duplicate a lot of logic in many places. this was the "cleanest" solution without breaking stuff |
There is hardly any, yes, but the value of cube.supress_index_on is saved in the metadata. |
Well, that slipped my review. That's technically a change in our storage specification and I don't think that's justified for this. I would propose to remove this again |
So one idea would be to create this "next version of the cube" already and use it internally everywhere. For now, we could keep the current
This layer would serve a clear purpose of de-coupling the public API from the internals. If done right, it will make refactoring as you proposed above a lot easier. So it's not some arbitrary layer of indirection as you make it sound ... |
Yes, I think this makes sense. |
I honestly do not have an easy proposal for a "next version" to move forward here. I'm a bit worried about the The dimension columns currently enforce
I believe some of these points where introduced to simplify the query planner but I'm not entirely convinced these are necessary or feasible and imho would require a bit more thorough analysis. Since we're talking about the public interface, let's park the discussion about behaviour for a second. If we put all of this aside, quite frankly, I'm also very fond of the public API just being a str, namely the prefix of a cube, similarly to how datasets are dealt with. This also makes everything much easier to share and doesn't require to share code. I think it's clear that this is a longer discussion which is not necessarily connected to this patch. I believe this is an important discussion to have but I think we should deal with this decoupled from this patch |
The intention is that the core kartothek spec is the single source of truth. Regardless of user input, it must never be possible to corrupt a dataset, I believe this is what we agree on. The question is for me rather: should we raise if the user provides different input than expected or infer it from what's really stored? We settled on the "let's infer from storage" behaviour back in the days to land somewhere between usability and consistency. I believe it is irrelevant whether the cube was discovered or not. The "cube metadata" is for most kartothek operations actually ordinary user metadata (smth we might want to eventually change as well) |
316d176
to
d362dea
Compare
I'll change the reliance on the written metadata in a follow up PR. I believe this can be done without breaking forward nor backwards compat. |
kartothek/io_components/utils.py
Outdated
@@ -19,6 +19,8 @@ | |||
except ImportError: | |||
from typing import Literal # type: ignore | |||
|
|||
# Literal false is sentinel, see function body of `_ensure_compatible_indices` for details | |||
INFERRED_INDICES = Union[Literal[False], List[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be camelcase, to be consistent with how python usually writes types: InferredIndices
I think it makes completely sense to never write inconsistent data. I think it's also fine to infer from the state rather than from user input, at least for indices (I'm not so sure for other cases, in which raising might be also a good option). Is this documented somewhere? I think at least for indices, this should be part of the user-facing documentation, e.g. here: https://kartothek.readthedocs.io/en/stable/_rst/kartothek.io.dask.dataframe.html#kartothek.io.dask.dataframe.update_dataset_from_ddf (a link to the more general docs would probably be best) |
Honestly, I do not know. I think the documentation (and module structure for that matter) is in a dismal state and it's currently very difficult to find for anything a single reliable source. partially because a lot is duplicated, but not reliably, and partially because it's just omitted. (documentation builds are also broken, btw) |
I could envision a section in https://kartothek.readthedocs.io/en/latest/guide/mutating_datasets.html which is very high level and this is a very general truth and not particular to any specific backend |
I think we should fix this at least for this specific case. I'm also asking because it seems the intended behavior concerning indices at update is to simply keep all the current indices and not build any new ones (correct?). If this is the case, then there is not need for the |
This is what we had at the very beginning of kartothek but it was a UX nightmare. Essentially every pipeline we built which used the update functionality looked like
where both function calls would look identically (except of one or two kwargs), so we merged the two into one. For the argument |
Indeed, building new ones would require us to read the entire dataset which might be incredibly expensive. However, this is a case we'd raise then, at least for the plain datasets. For cubes we cannot raise since we cannot reliably say whether the index is provided for a "future extension" or whether it is actually supposed to be there |
a2c4583
to
4cbd840
Compare
Co-authored-by: jochen-ott-by <jochen.ott@blueyonder.com>
Description:
This fixes an issue where cubes could remove or blacklist index columns and update a dataset/cube. This would then not update said index any longer leaving the dataset in a corrupt state. The only way to consistently deal with this would be to remove the index entirely or to update it. The behaviour for plain datasets, so far, was to update the index regardlessly if the input was omitted. I needed to change that logic slightly but I think the behaviour is still ok. This way the two interfaces area again a bit closer together.
The fix is very pragmatic but everything else would require a rework of the interface and refactoring of a lot of code :/