Announcement: Create release 5.0 which is compatible with 3.20.0 but incompatible with 4.0 #471

steffen-schroeder-by · 2021-05-18T14:42:06Z

Background

Kartothek 4.0 is a great release with many extremely valuable simplifications. But during internal migrations from 3.20.0 to 4.0, it turned out that for many of those well thought changes an easy migration could not be taken with reasonable effort.
Also some of the simplifications caused customer facing incidents. Thus we are currently stuck with 3.20.0.

To give this important library the deserved attention and speed we are working with a bunch of colleagues on getting out a stable version of kartothek again, where we can simply migrate to and accept community feedback and pull requetst.

To move on, we need to bring a current stable version (3.20.0) to a version which has the simplification of 4.0.
We could have continued with development of 3.x branch and slowly converging against 4.0.
We have decided against this for the following reasons:

some of the changes are on syntax level breaking and would require a new major version itself (between 3.x and 4.0)
pull requests and issues from the community are usually reported against the latest release. By that, we would on the one hand create a moving migration target and on the other hand slow down accepting those PR, since the main effort would go into 3.x first
this would also include dependency updates which would usually only be accepted for master, opening another dimension where 3.x and 4.0 are diverging.

Planned next steps

To overcome this we decided on the following way forward.

We will create a release 5.0 based on 3.20.0. All changes which happened after 3.20.0 will be discarded. There will be no migration path from 4.0.1 to 5.0
After having that release in place, we will again accept PRs, issues and dependency updates.
From the learning of the 3.20.0 to 4.0 migration, we will gradually add those incompatible changes back (like removal of the multi-table-feature). Where applicable, we will offer a straight migration path to a 6.0 version. Where this is not applicable (like when return types changed from Dict[str,List[T]] to List[T]) we might go a step in between. Communication on that will follow.
Finally there will be a release 6.0 which will have most of the simplification.

We know that especially the incompatibility between 4.0 and 5.0 will be an issue for some consumers. We that's why encourage you to use the very stable kartothek version 3.20.0 and not version 4.x

The text was updated successfully, but these errors were encountered:

xhochy · 2021-05-19T08:18:26Z

We will create a release 5.0 based on 3.20.0. All changes which happened after 3.20.0 will be discarded. There will be no migration path from 4.0.1 to 5.0

With sadly no documentation on what went wrong and having this strict decision that 4.x will be dropped, I don't really feel comfortable advocating/using Kartothek anymore. Can you give any insights what are design decisions with 4.x that break the usage of it? This transparency would be useful to understand so that the same issues don't occur with 5.x again.

steffen-schroeder-by · 2021-05-19T09:52:14Z

Thanks for reaching out. I'll compile list of 4.0 changes and our intended way of treating them later today.

fjetter · 2021-05-19T09:55:02Z

The only reported issue I am aware of is #452 . I haven't investigated it thoroughly but from my first impression this looks like a problem which is easy enough to be fixed and doesn't warrant this dramatic change.

I would appreciate more openness about these things if kartothek shall remain a shared open source library since I also do not feel comfortable with advocating anymore if this is the way the library is being developed.

steffen-schroeder-by · 2021-05-19T14:40:58Z

Let me get a bit more into detail about the reasoning. First it's correct that the only public reported issue is #452 and it's predecessors. That one was specially bad because the issue surface not when kartothek writes out the data but when it reads it the next time. Because of some incidents which were triggered that that we rolled back to kartothek 3.20 immediately and stopped the rollout of Kartothek 4.0. But this is only a bug. Bugs can be fixed.

Concrete Changes

I want to be as open and transparent about the reasoning as possible.
The main reason for our way forward is that actual migration of all our services and internal tooling became a huge effort.

Quoting from an internal high level analysis (slightly changed by myself for public internet):

Right now, smoothly migrating is not possible, because kartothek does so many changes. It also has gaps in the documentation (e.g. missing documentation for changes in date_as_object). It also has issues, as the behavior of reading cubes is changed (dates are now objects), but there is no way to get the old behavior, which is the preferred of many consumers.

A generic recipe for a good migration path would be

first add support for the new feature/behavior in kartothek, without enforcing it. Warn users about the deprecation of the old behavior
release a version that supports all of the new, but also the old behavior. This is the intermediate version we want to use during the migration.
after there are no deprecation warnings (e.g. in tests that have good coverage; maybe even look at prod for a week), we can update to the new version of kartothek that drops all support for the code paths that emitted deprecation warnings.
A good example how this could work is the move to kartothek.api, where this uipdate schema works. Another example is the removal of DatasetMetadata.table_meta in favour of DatasetMetadata.schema: The new .schema is supported already in kartothek 3.20 and using .table_meta emits a deprecation warning.

Counterexamples that currently don't work well is:

removing the table_name argument completely

changing the default data type for dates. This is particularly nasty, as we typically don't have good test coverage to detect such changes, esp. for dataset transforming libs that "read-process-write" a whole dataset like the ServiceA or ServiceB: these don't contain tests that check the output data type; they usually just write what they read. If they now read different types, they might want to write different types inadvertently.

changing the structure of return values (from list of dicts to list of dataframes), as this does not allow incremental migration

table_name changes were poorly tested and caused a huge incident

We then went on to examine each and every change between 3.2.0 and 4.0
Not every item (as taken from the changelog) is decided, yet, but the idea is to keep the following changes and addDeprecationWarnings were needed:

Removal of kartothek.io.merge module
class :class:~kartothek.core.dataset.DatasetMetadata now has an attribute called schema which replaces the previous attribute table_meta and returns only a single schema
All read pipelines will now automatically infer the table to read such that it is no longer necessary to provide table or table_name as an input argument
All pipelines which previously accepted an argument tables to select the subset of tables to load no longer accept this keyword. Instead the to-be-loaded table will be inferred
Trying to read a multi-tabled dataset will now cause an exception telling users that this is no longer supported with kartothek 4.0
Remove the following list of deprecated arguments for io pipelines * label_filter * central_partition_metadata * load_dynamic_metadata * load_dataset_metadata * concat_partitions_on_primary_index
Remove output_dataset_uuid and df_serializer from :func:kartothek.io.eager.commit_dataset since these arguments didn't have any effect
Remove metadata, df_serializer, overwrite, metadata_merger from :func:kartothek.io.eager.write_single_partition
No longer allow to pass delete_scope as a delayed object to :func:~kartothek.io.dask.dataframe.update_dataset_from_ddf
Remove argument table_name from :func:~kartothek.io.dask.dataframe.collect_dataset_metadata

We are in favor of dropping:

All outputs which previously returned a sequence of dictionaries where each key-value pair would correspond to a table-data pair now returns only one :class:pandas.DataFrame
Default value for argument date_as_object is now universally set to True.

Everything else is to be decided, yet.

For the Removal of multi table feature we have an intention to keep the removal, but it depends if we can come up with a proper migration

Kartothek 4.0 dropping reasons

As shown above, there is a proper migration path needed.
There is a need on adding changes to the old release and probably changes to the new release
To do on feature level a proper migration, I see 2 ways (and yes, it's biased):

.	Pros	Cons
Develop on 4.0, converge 3.x to 4.x	No effort for 4.0 consumers	through ongoing work on 4.0 3.x might hardly catch up, BY respource are bound to 3.x and might even reject changes that make migration harder, 3.x does not get dependabot changes, 4.0 changes might need to be backported to 3.x
Make 3.20 a new 5.0, drop 4.0	Simpler Migration. 3.2 -> 5.0 is a noop. Learning from 4.0 can be applied, everyone works on master (synergies), pr are very welcome and not seen as danger moving forward	no migration path from 4.0 to 5.0

Thanks for the feedback @xhochy and @fjetter - your input is always valued and appriciated. I know this move is drastic and I would be more then happy if we would not be in this situation. But we are and need to move forward.
Hope this openness helped to give you some background. Let me know your thoughts.

lr4d · 2021-05-21T17:48:27Z

tbqh I think this is terrible idea.

If the migration from 3.x to 4.x is an issue, I suggest time be invested in making scripts that take care of updating the old datasets to 4.x as well as the code.

steffen-schroeder-by · 2021-05-25T09:06:59Z

Hi @xhochy, Hi @fjetter with my latest explanation is it more understandable why we want to go in this direction?
I know it's all but ideal, but from our point of view the best way forward (or the least worst, depending on perspective).

steffen-schroeder-by · 2021-05-25T09:13:25Z

If the migration from 3.x to 4.x is an issue, I suggest time be invested in making scripts that take care of updating the old datasets to 4.x as well as the code.

Making a script is some effort but doable. The problem is tracking the (potential unknown) consumers who need to run it.
And this also might leave us with unknown unknowns because we have seen that we are missing test coverage for some of the incompatible changes (like #445 slipped through)

fjetter · 2021-05-25T09:38:34Z

tbh, I am still struggling to understand. What I do understand is

update_dataset_from_ddf corrupts datasets if table names diverge on create -> update #452 cause severe problems. Got it. This needs to be fixed and was definitely not intentional.
The change of the default for date_as_object is inconvenient for some users. This is unfortunate but there are good reasons why this default changed and there are also very good reasons why this flag should be removed entirely in the mid to long-term future. Apart from incredible code complexity in the area of predicate pushdown, schema validation and schema evolution, we have bugs which cannot be fixed properly while keeping this flag. If this is an issue about cube consumers, an intermediate migration path would be simply to allow them to specify this toggle themselves

imho, these two points do not warrant any rollback but there are sane and much cheaper ways to move forward. For all other listed issue, I am struggling to see the relevance since they are dealing with rather fringe use cases.

The entire intention of the 4.0 release was to get rid of legacy code in an affordable way. The only intended change in behaviour (other than dropping entire, unused features, of course) was the date_as_object change.

imho, rolling back and implementing a proper deprecation lifecycle for everything is not worth it since the work involved for the API migrations should be much cheaper than the rollback + deprecation path. On top of this, the intended changes for the 4.0 release have been in a lengthy BY-internal review process before implementation even started. Rolling back because months after there are other opinions is not the way I want to develop a library.

That all said, the phrasing of this issue and its title already indicate that you are willing to pull super-admin privileges to get your way and this doesn't feel like an open discussion. I'm not willing to put more time into this issue if the outcome is predetermined, anyhow.

Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)" This reverts commit fdc9779. Revert "fix mistakes in documentation" This reverts commit 4e4b5e0. Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)" This reverts commit d027ca2. Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)" This reverts commit 97cd553. Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)" This reverts commit e48d67a. Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)" This reverts commit 90ee486. Revert "Prepare release 4.0.1" This reverts commit b278503. Revert "Fix tests for dask dataframe and delayed backends" This reverts commit 5520f74. Revert "Add end-to-end regression test" This reverts commit 8a3e6ae. Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)" This reverts commit a26e840. Revert "Set release date for 4.0" This reverts commit 08a8094. Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)" This reverts commit 494732d. Revert "Add tests for non-default table (JDASoftwareGroup#440)" This reverts commit 3807a02. Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)" This reverts commit f7615ec. Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)" This reverts commit 75ffdb5. Revert "Remove inferred indices (JDASoftwareGroup#438)" This reverts commit b1e2535. Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)" This reverts commit b349cee. Revert "Remove all deprecated arguments (JDASoftwareGroup#434)" This reverts commit 74f0790. Revert "Remove multi table feature (JDASoftwareGroup#431)" This reverts commit 032856a.

xhochy · 2021-05-26T15:13:33Z

That all said, the phrasing of this issue and its title already indicate that you are willing to pull super-admin privileges to get your way and this doesn't feel like an open discussion. I'm not willing to put more time into this issue if the outcome is predetermined, anyhow.

I can very much second @fjetter's statement here. I'm happy to work with you on fixing the issues that came with the 4.0 release. Apart from the date_as_object change, they should be fixable without the proposed route of "forgetting that 4.0 ever happened". The date_as_object is something we had in mind years ago, I wouldn't see that as a great inconvenience but just as something that has to be done anyways. The wording of the issue sounds though like "this is not up for discussion".

To make this more productive and less prone to misunderstandings, I'm happy to also have a virtual coffee chat on a 1:1 basis to clear things up. This is probably less tense than the plain text communication here.

steffen-schroeder-by · 2021-05-26T18:36:39Z

@xhochy I sent you a mail with proposals for a virtual coffee. Thanks for taking the time.

steffen-schroeder-by · 2021-06-03T13:01:01Z

Just want to give a small update on this.
I had a very constructive call with @xhochy . I explained the reasoning about the intended actions. He helped me to understand the community and kartothek user's point of view. I admit that we should have communicated earlier and more transparent the challenges the adoption of kartothek 4.0 gave us.
Also the message as I have written it did not express enough a way for 4.0 onwards. It was never intended to pretend that version has never existed. Personally I'm still convinced that its simplification are needed.
So let me explain the target state. There should be a version 6.0 which has most (if not all) simplifications version 4.0 introduced. There should be migrations paths for each incompatible change from 4.0 AND 5.0 to 6.0. So the investment which was made for migration from 3.20 to 4.0 is till valid. 4.0 is not supposed to be a dead end.

The next steps are (aligned with @xhochy):

A BlueYonder associate will work on a migration path from 4.0 to 6.0. Once the results are ready, we will publish that as a separate Github Issue.
I will create maintainance branches based on 3.20 and 4.0.1 (HEAD) so critical fixes can be applied on those branches. Even dependabot updates can be targeted on those branches
On Monday 2021-06-07 I will merge PR 5.x: rollback to the 3.20.0 state (#471) #472
We will create a 5.0rc from that and test that internally
If those tests are successful we will publish a 5.0
When we begin the work on deciding on a migration path from 5.0 to 6.0 with the learnings which we had with the 3.20 to 4.0 migration
Edit: there are unreleased changes on master branch which should go into a 4.0.2. Will create a release for this as well.

Let me know your thoughts.

Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)" This reverts commit fdc9779. Revert "fix mistakes in documentation" This reverts commit 4e4b5e0. Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)" This reverts commit d027ca2. Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)" This reverts commit 97cd553. Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)" This reverts commit e48d67a. Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)" This reverts commit 90ee486. Revert "Prepare release 4.0.1" This reverts commit b278503. Revert "Fix tests for dask dataframe and delayed backends" This reverts commit 5520f74. Revert "Add end-to-end regression test" This reverts commit 8a3e6ae. Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)" This reverts commit a26e840. Revert "Set release date for 4.0" This reverts commit 08a8094. Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)" This reverts commit 494732d. Revert "Add tests for non-default table (JDASoftwareGroup#440)" This reverts commit 3807a02. Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)" This reverts commit f7615ec. Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)" This reverts commit 75ffdb5. Revert "Remove inferred indices (JDASoftwareGroup#438)" This reverts commit b1e2535. Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)" This reverts commit b349cee. Revert "Remove all deprecated arguments (JDASoftwareGroup#434)" This reverts commit 74f0790. Revert "Remove multi table feature (JDASoftwareGroup#431)" This reverts commit 032856a.

Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (#466)" This reverts commit fdc9779. Revert "fix mistakes in documentation" This reverts commit 4e4b5e0. Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (#460)" This reverts commit d027ca2. Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (#461)" This reverts commit 97cd553. Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (#458)" This reverts commit e48d67a. Revert "Fix bug when loading few columns of a dataset with many primary indices (#446)" This reverts commit 90ee486. Revert "Prepare release 4.0.1" This reverts commit b278503. Revert "Fix tests for dask dataframe and delayed backends" This reverts commit 5520f74. Revert "Add end-to-end regression test" This reverts commit 8a3e6ae. Revert "Fix dataset corruption after updates (#445)" This reverts commit a26e840. Revert "Set release date for 4.0" This reverts commit 08a8094. Revert "Return dask scalar for store and update from ddf (#437)" This reverts commit 494732d. Revert "Add tests for non-default table (#440)" This reverts commit 3807a02. Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (#441)" This reverts commit f7615ec. Revert "Set default for dates_as_object to True (#436)" This reverts commit 75ffdb5. Revert "Remove inferred indices (#438)" This reverts commit b1e2535. Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (#422)" This reverts commit b349cee. Revert "Remove all deprecated arguments (#434)" This reverts commit 74f0790. Revert "Remove multi table feature (#431)" This reverts commit 032856a.

steffen-schroeder-by · 2021-06-11T19:29:57Z

We merged @ilia-zaitcev-by's PR. Next steps are releasing a 5.0.0rc1, extensive testing and then preparing the 5.0.
We are currently internally building a plan for 6.0 with migration plans from 4.0 and from 5.0. With the learnings from the 3.20 to 4.0 upgrade we wanna make sure that this is as smooth as possible for all stakeholders.
Furthermore, I created https://github.com/JDASoftwareGroup/kartothek/tree/maintainance_3.20 and https://github.com/JDASoftwareGroup/kartothek/tree/maintainance_4.0 so we are able to continue maintenance on those releases of needed.
Thanks again to @ilia-zaitcev-by for preparing the PR and to @xhochy for the constructive feedback and attitude of finding a way forward.
Closing this issues for now.

xhochy · 2021-06-20T19:09:26Z

Thanks @steffen-schroeder-by for the communication and constructive discussion!

steffen-schroeder-by pinned this issue May 18, 2021

ilia-zaitcev-by mentioned this issue May 26, 2021

5.x: rollback to the 3.20.0 state (#471) #472

Merged

steffen-schroeder-by closed this as completed Jun 11, 2021

steffen-schroeder-by unpinned this issue Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Announcement: Create release 5.0 which is compatible with 3.20.0 but incompatible with 4.0 #471

Announcement: Create release 5.0 which is compatible with 3.20.0 but incompatible with 4.0 #471

steffen-schroeder-by commented May 18, 2021

xhochy commented May 19, 2021

steffen-schroeder-by commented May 19, 2021

fjetter commented May 19, 2021

steffen-schroeder-by commented May 19, 2021

lr4d commented May 21, 2021

steffen-schroeder-by commented May 25, 2021

steffen-schroeder-by commented May 25, 2021

fjetter commented May 25, 2021

xhochy commented May 26, 2021

steffen-schroeder-by commented May 26, 2021

steffen-schroeder-by commented Jun 3, 2021 •

edited

Loading

steffen-schroeder-by commented Jun 11, 2021

xhochy commented Jun 20, 2021

Announcement: Create release 5.0 which is compatible with 3.20.0 but incompatible with 4.0 #471

Announcement: Create release 5.0 which is compatible with 3.20.0 but incompatible with 4.0 #471

Comments

steffen-schroeder-by commented May 18, 2021

Background

Planned next steps

xhochy commented May 19, 2021

steffen-schroeder-by commented May 19, 2021

fjetter commented May 19, 2021

steffen-schroeder-by commented May 19, 2021

Concrete Changes

Kartothek 4.0 dropping reasons

lr4d commented May 21, 2021

steffen-schroeder-by commented May 25, 2021

steffen-schroeder-by commented May 25, 2021

fjetter commented May 25, 2021

xhochy commented May 26, 2021

steffen-schroeder-by commented May 26, 2021

steffen-schroeder-by commented Jun 3, 2021 • edited Loading

steffen-schroeder-by commented Jun 11, 2021

xhochy commented Jun 20, 2021

steffen-schroeder-by commented Jun 3, 2021 •

edited

Loading