Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove multi table feature #431

Merged
merged 8 commits into from
Mar 16, 2021

Conversation

fjetter
Copy link
Collaborator

@fjetter fjetter commented Mar 10, 2021

Description:

As the title state, this will remove any API references to the dataset internal multi table functionality. It is effectively no longer possible to write these.

It goes without saying. This change is in terms of API as breaking as it gets. This is a new major release and I intend to drop many of the deprecated features after merging this as well (effectively everything which raises a warning at the moment)

@fjetter fjetter force-pushed the remove_multitable branch 6 times, most recently from 1408d55 to 2c647e2 Compare March 10, 2021 15:09
@fjetter fjetter changed the title WIP Remove multi table feature Remove multi table feature Mar 10, 2021
@mlondschien
Copy link
Contributor

mlondschien commented Mar 11, 2021

Not sure how close this is to merging, but I thought I would let you know what this breaks for us:

  • read_table no longer accepts table as a keyword argument.
  • DatasetMetadata.table_meta["table"] and DatasetFactory.table_meta["table"] raise a TypeError: an integer is required. IIUC this is superseeded by .schema.
  • read_dataset_as_dataframes__iterator no longer takes a columns = {"table": list-like} argument and no longer returns {"table": df} data frames.

Are you planning on making a deprecating release in between 3.19 and 4.0? The table kwarg is no problem since "table" is already a default value. However it would be nice to have schema available and be able to pass columns=list-like to get a data frame before 4.0 or be able to use the current API until 4.1.

@fjetter
Copy link
Collaborator Author

fjetter commented Mar 11, 2021

If the removal of the table is actually a problem we can keep it without effect but I would prefer the clean removal since we've been dragging unused arguments around for years.

All the changes you mentioned were anticipated. We can make the table_meta change backwards compatible. I actually intended to remove it altogether but I don't have a strong opinion on this.

Regarding the columns input, this is actually not that easy to achieve and I'm not sure if it is worth the trouble. The problem here is (and also the reason why this must go) that all pipelines are implementing this somewhat differently

bottom line is that all of these changes feel like breaking but easily dealt with and I'm hesitant to introduce a lot of glue code to ease this migration. I believe these problems are quickly addressed by going over the code base once everywhere.

DatasetMetadata.table_meta["table"] and DatasetFactory.table_meta["table"] raise a TypeError: an integer is required.

This, I will fix. Cryptic error message are always a problem, even for intended changes.

@mlondschien
Copy link
Contributor

Would it be an option to add

@property
def schema(self):
    return self.table_meta["table"]

to DatasetMetadata and DatasetFactory before 4.0?

I guess we will be able to live with the change in read_dataset_as_dataframes__iterator API :).

@fjetter
Copy link
Collaborator Author

fjetter commented Mar 11, 2021

yes, we can add the schema pre 4.X. We have one or two unreleased bugfixes on master which probable makes sense to release anyhow before that

@fjetter fjetter force-pushed the remove_multitable branch from 1f3fccb to f05e948 Compare March 11, 2021 10:32
@fjetter fjetter mentioned this pull request Mar 11, 2021
@fjetter
Copy link
Collaborator Author

fjetter commented Mar 11, 2021

Lazy consensus merge by Monday 15-03-2021 9:00 UTC

@fjetter fjetter mentioned this pull request Mar 11, 2021
@fjetter fjetter force-pushed the remove_multitable branch from 6738a32 to 8a6840a Compare March 15, 2021 09:08
kartothek/core/dataset.py Show resolved Hide resolved
)
combined_indices = self.indices.copy()
combined_indices.update(indices)
return self.copy(indices=combined_indices)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I do miss it , but as I currently understand if self.primary_indices_loaded is True we return the object itself. If not we load the indices and return a copy. So self.primary_indices_loaded is not updated from what I see, so I wonder why not?
Why even the copy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy methods were introduced at the very beginning of kartothek and we perceived it as a "clean coding" by not mutating any objects inplace but always return a new instance. This is what the copy methods are doing. By now I consider this unnecessary boilerplate but I didn't want to break any patterns here. Therefore if nothing is to be done, the same instance is returned, otherwise a new instance with the mutated state is returned

FYI this method is unchanged but was moved for some reason. I'll move it back to have cleaner diffs / blames. still valid question, of course :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay got it. I understand the arguments of preventing mutation but also that not everything can and should be folloewed 100% and here I might argue against it but as it is pre-change code I am totally fine :)

kartothek/core/dataset.py Outdated Show resolved Hide resolved
kartothek/io_components/write.py Show resolved Hide resolved
@bjoern-meier-by
Copy link

Thanks @fjetter for the cleanup and refactoring. 🙇

@fjetter
Copy link
Collaborator Author

fjetter commented Mar 15, 2021

@bjoern-meier-by thanks for reviewing!

@fjetter fjetter force-pushed the remove_multitable branch from a360662 to 1a0258d Compare March 15, 2021 15:29
@fjetter fjetter merged commit 032856a into JDASoftwareGroup:master Mar 16, 2021
This was referenced Mar 16, 2021
ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this pull request May 26, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this pull request Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
steffen-schroeder-by pushed a commit that referenced this pull request Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (#431)"
This reverts commit 032856a.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Removal of complex input types
3 participants