Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Announcement: Create release 5.0 which is compatible with 3.20.0 but incompatible with 4.0 #471

Closed
steffen-schroeder-by opened this issue May 18, 2021 · 13 comments

Comments

@steffen-schroeder-by
Copy link
Contributor

Background

Kartothek 4.0 is a great release with many extremely valuable simplifications. But during internal migrations from 3.20.0 to 4.0, it turned out that for many of those well thought changes an easy migration could not be taken with reasonable effort.
Also some of the simplifications caused customer facing incidents. Thus we are currently stuck with 3.20.0.

To give this important library the deserved attention and speed we are working with a bunch of colleagues on getting out a stable version of kartothek again, where we can simply migrate to and accept community feedback and pull requetst.

To move on, we need to bring a current stable version (3.20.0) to a version which has the simplification of 4.0.
We could have continued with development of 3.x branch and slowly converging against 4.0.
We have decided against this for the following reasons:

  1. some of the changes are on syntax level breaking and would require a new major version itself (between 3.x and 4.0)
  2. pull requests and issues from the community are usually reported against the latest release. By that, we would on the one hand create a moving migration target and on the other hand slow down accepting those PR, since the main effort would go into 3.x first
  3. this would also include dependency updates which would usually only be accepted for master, opening another dimension where 3.x and 4.0 are diverging.

Planned next steps

To overcome this we decided on the following way forward.

  1. We will create a release 5.0 based on 3.20.0. All changes which happened after 3.20.0 will be discarded. There will be no migration path from 4.0.1 to 5.0
  2. After having that release in place, we will again accept PRs, issues and dependency updates.
  3. From the learning of the 3.20.0 to 4.0 migration, we will gradually add those incompatible changes back (like removal of the multi-table-feature). Where applicable, we will offer a straight migration path to a 6.0 version. Where this is not applicable (like when return types changed from Dict[str,List[T]] to List[T]) we might go a step in between. Communication on that will follow.
  4. Finally there will be a release 6.0 which will have most of the simplification.

We know that especially the incompatibility between 4.0 and 5.0 will be an issue for some consumers. We that's why encourage you to use the very stable kartothek version 3.20.0 and not version 4.x

@xhochy
Copy link
Contributor

xhochy commented May 19, 2021

We will create a release 5.0 based on 3.20.0. All changes which happened after 3.20.0 will be discarded. There will be no migration path from 4.0.1 to 5.0

With sadly no documentation on what went wrong and having this strict decision that 4.x will be dropped, I don't really feel comfortable advocating/using Kartothek anymore. Can you give any insights what are design decisions with 4.x that break the usage of it? This transparency would be useful to understand so that the same issues don't occur with 5.x again.

@steffen-schroeder-by
Copy link
Contributor Author

Thanks for reaching out. I'll compile list of 4.0 changes and our intended way of treating them later today.

@fjetter
Copy link
Collaborator

fjetter commented May 19, 2021

The only reported issue I am aware of is #452 . I haven't investigated it thoroughly but from my first impression this looks like a problem which is easy enough to be fixed and doesn't warrant this dramatic change.

I would appreciate more openness about these things if kartothek shall remain a shared open source library since I also do not feel comfortable with advocating anymore if this is the way the library is being developed.

@steffen-schroeder-by
Copy link
Contributor Author

Let me get a bit more into detail about the reasoning. First it's correct that the only public reported issue is #452 and it's predecessors. That one was specially bad because the issue surface not when kartothek writes out the data but when it reads it the next time. Because of some incidents which were triggered that that we rolled back to kartothek 3.20 immediately and stopped the rollout of Kartothek 4.0. But this is only a bug. Bugs can be fixed.

Concrete Changes

I want to be as open and transparent about the reasoning as possible.
The main reason for our way forward is that actual migration of all our services and internal tooling became a huge effort.

Quoting from an internal high level analysis (slightly changed by myself for public internet):

Right now, smoothly migrating is not possible, because kartothek does so many changes. It also has gaps in the documentation (e.g. missing documentation for changes in date_as_object). It also has issues, as the behavior of reading cubes is changed (dates are now objects), but there is no way to get the old behavior, which is the preferred of many consumers.

A generic recipe for a good migration path would be

first add support for the new feature/behavior in kartothek, without enforcing it. Warn users about the deprecation of the old behavior
release a version that supports all of the new, but also the old behavior. This is the intermediate version we want to use during the migration.
after there are no deprecation warnings (e.g. in tests that have good coverage; maybe even look at prod for a week), we can update to the new version of kartothek that drops all support for the code paths that emitted deprecation warnings.
A good example how this could work is the move to kartothek.api, where this uipdate schema works. Another example is the removal of DatasetMetadata.table_meta in favour of DatasetMetadata.schema: The new .schema is supported already in kartothek 3.20 and using .table_meta emits a deprecation warning.

Counterexamples that currently don't work well is:

  • removing the table_name argument completely
  • changing the default data type for dates. This is particularly nasty, as we typically don't have good test coverage to detect such changes, esp. for dataset transforming libs that "read-process-write" a whole dataset like the ServiceA or ServiceB: these don't contain tests that check the output data type; they usually just write what they read. If they now read different types, they might want to write different types inadvertently.
  • changing the structure of return values (from list of dicts to list of dataframes), as this does not allow incremental migration
  • table_name changes were poorly tested and caused a huge incident

We then went on to examine each and every change between 3.2.0 and 4.0
Not every item (as taken from the changelog) is decided, yet, but the idea is to keep the following changes and addDeprecationWarnings were needed:

  • Removal of kartothek.io.merge module
  • class :class:~kartothek.core.dataset.DatasetMetadata now has an attribute called schema which replaces the previous attribute table_meta and returns only a single schema
  • All read pipelines will now automatically infer the table to read such that it is no longer necessary to provide table or table_name as an input argument
  • All pipelines which previously accepted an argument tables to select the subset of tables to load no longer accept this keyword. Instead the to-be-loaded table will be inferred
  • Trying to read a multi-tabled dataset will now cause an exception telling users that this is no longer supported with kartothek 4.0
  • Remove the following list of deprecated arguments for io pipelines * label_filter * central_partition_metadata * load_dynamic_metadata * load_dataset_metadata * concat_partitions_on_primary_index
  • Remove output_dataset_uuid and df_serializer from :func:kartothek.io.eager.commit_dataset since these arguments didn't have any effect
  • Remove metadata, df_serializer, overwrite, metadata_merger from :func:kartothek.io.eager.write_single_partition
  • No longer allow to pass delete_scope as a delayed object to :func:~kartothek.io.dask.dataframe.update_dataset_from_ddf
  • Remove argument table_name from :func:~kartothek.io.dask.dataframe.collect_dataset_metadata

We are in favor of dropping:

  • All outputs which previously returned a sequence of dictionaries where each key-value pair would correspond to a table-data pair now returns only one :class:pandas.DataFrame
  • Default value for argument date_as_object is now universally set to True.

Everything else is to be decided, yet.

For the Removal of multi table feature we have an intention to keep the removal, but it depends if we can come up with a proper migration

Kartothek 4.0 dropping reasons

As shown above, there is a proper migration path needed.
There is a need on adding changes to the old release and probably changes to the new release
To do on feature level a proper migration, I see 2 ways (and yes, it's biased):

. Pros Cons
Develop on 4.0, converge 3.x to 4.x No effort for 4.0 consumers through ongoing work on 4.0 3.x might hardly catch up, BY respource are bound to 3.x and might even reject changes that make migration harder, 3.x does not get dependabot changes, 4.0 changes might need to be backported to 3.x
Make 3.20 a new 5.0, drop 4.0 Simpler Migration. 3.2 -> 5.0 is a noop. Learning from 4.0 can be applied, everyone works on master (synergies), pr are very welcome and not seen as danger moving forward no migration path from 4.0 to 5.0

Thanks for the feedback @xhochy and @fjetter - your input is always valued and appriciated. I know this move is drastic and I would be more then happy if we would not be in this situation. But we are and need to move forward.
Hope this openness helped to give you some background. Let me know your thoughts.

@lr4d
Copy link
Collaborator

lr4d commented May 21, 2021

tbqh I think this is terrible idea.

If the migration from 3.x to 4.x is an issue, I suggest time be invested in making scripts that take care of updating the old datasets to 4.x as well as the code.

@steffen-schroeder-by
Copy link
Contributor Author

Hi @xhochy, Hi @fjetter with my latest explanation is it more understandable why we want to go in this direction?
I know it's all but ideal, but from our point of view the best way forward (or the least worst, depending on perspective).

@steffen-schroeder-by
Copy link
Contributor Author

If the migration from 3.x to 4.x is an issue, I suggest time be invested in making scripts that take care of updating the old datasets to 4.x as well as the code.

Making a script is some effort but doable. The problem is tracking the (potential unknown) consumers who need to run it.
And this also might leave us with unknown unknowns because we have seen that we are missing test coverage for some of the incompatible changes (like #445 slipped through)

@fjetter
Copy link
Collaborator

fjetter commented May 25, 2021

tbh, I am still struggling to understand. What I do understand is

  1. update_dataset_from_ddf corrupts datasets if table names diverge on create -> update #452 cause severe problems. Got it. This needs to be fixed and was definitely not intentional.
  2. The change of the default for date_as_object is inconvenient for some users. This is unfortunate but there are good reasons why this default changed and there are also very good reasons why this flag should be removed entirely in the mid to long-term future. Apart from incredible code complexity in the area of predicate pushdown, schema validation and schema evolution, we have bugs which cannot be fixed properly while keeping this flag. If this is an issue about cube consumers, an intermediate migration path would be simply to allow them to specify this toggle themselves

imho, these two points do not warrant any rollback but there are sane and much cheaper ways to move forward. For all other listed issue, I am struggling to see the relevance since they are dealing with rather fringe use cases.

The entire intention of the 4.0 release was to get rid of legacy code in an affordable way. The only intended change in behaviour (other than dropping entire, unused features, of course) was the date_as_object change.

imho, rolling back and implementing a proper deprecation lifecycle for everything is not worth it since the work involved for the API migrations should be much cheaper than the rollback + deprecation path. On top of this, the intended changes for the 4.0 release have been in a lengthy BY-internal review process before implementation even started. Rolling back because months after there are other opinions is not the way I want to develop a library.


That all said, the phrasing of this issue and its title already indicate that you are willing to pull super-admin privileges to get your way and this doesn't feel like an open discussion. I'm not willing to put more time into this issue if the outcome is predetermined, anyhow.

ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this issue May 26, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
@xhochy
Copy link
Contributor

xhochy commented May 26, 2021

That all said, the phrasing of this issue and its title already indicate that you are willing to pull super-admin privileges to get your way and this doesn't feel like an open discussion. I'm not willing to put more time into this issue if the outcome is predetermined, anyhow.

I can very much second @fjetter's statement here. I'm happy to work with you on fixing the issues that came with the 4.0 release. Apart from the date_as_object change, they should be fixable without the proposed route of "forgetting that 4.0 ever happened". The date_as_object is something we had in mind years ago, I wouldn't see that as a great inconvenience but just as something that has to be done anyways. The wording of the issue sounds though like "this is not up for discussion".

To make this more productive and less prone to misunderstandings, I'm happy to also have a virtual coffee chat on a 1:1 basis to clear things up. This is probably less tense than the plain text communication here.

@steffen-schroeder-by
Copy link
Contributor Author

@xhochy I sent you a mail with proposals for a virtual coffee. Thanks for taking the time.

@steffen-schroeder-by
Copy link
Contributor Author

steffen-schroeder-by commented Jun 3, 2021

Just want to give a small update on this.
I had a very constructive call with @xhochy . I explained the reasoning about the intended actions. He helped me to understand the community and kartothek user's point of view. I admit that we should have communicated earlier and more transparent the challenges the adoption of kartothek 4.0 gave us.
Also the message as I have written it did not express enough a way for 4.0 onwards. It was never intended to pretend that version has never existed. Personally I'm still convinced that its simplification are needed.
So let me explain the target state. There should be a version 6.0 which has most (if not all) simplifications version 4.0 introduced. There should be migrations paths for each incompatible change from 4.0 AND 5.0 to 6.0. So the investment which was made for migration from 3.20 to 4.0 is till valid. 4.0 is not supposed to be a dead end.

The next steps are (aligned with @xhochy):

  • A BlueYonder associate will work on a migration path from 4.0 to 6.0. Once the results are ready, we will publish that as a separate Github Issue.
  • I will create maintainance branches based on 3.20 and 4.0.1 (HEAD) so critical fixes can be applied on those branches. Even dependabot updates can be targeted on those branches
  • On Monday 2021-06-07 I will merge PR 5.x: rollback to the 3.20.0 state (#471) #472
  • We will create a 5.0rc from that and test that internally
  • If those tests are successful we will publish a 5.0
  • When we begin the work on deciding on a migration path from 5.0 to 6.0 with the learnings which we had with the 3.20 to 4.0 migration
  • Edit: there are unreleased changes on master branch which should go into a 4.0.2. Will create a release for this as well.

Let me know your thoughts.

ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this issue Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
steffen-schroeder-by pushed a commit that referenced this issue Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (#431)"
This reverts commit 032856a.
@steffen-schroeder-by
Copy link
Contributor Author

We merged @ilia-zaitcev-by's PR. Next steps are releasing a 5.0.0rc1, extensive testing and then preparing the 5.0.
We are currently internally building a plan for 6.0 with migration plans from 4.0 and from 5.0. With the learnings from the 3.20 to 4.0 upgrade we wanna make sure that this is as smooth as possible for all stakeholders.
Furthermore, I created https://github.com/JDASoftwareGroup/kartothek/tree/maintainance_3.20 and https://github.com/JDASoftwareGroup/kartothek/tree/maintainance_4.0 so we are able to continue maintenance on those releases of needed.
Thanks again to @ilia-zaitcev-by for preparing the PR and to @xhochy for the constructive feedback and attitude of finding a way forward.
Closing this issues for now.

@xhochy
Copy link
Contributor

xhochy commented Jun 20, 2021

Thanks @steffen-schroeder-by for the communication and constructive discussion!

@steffen-schroeder-by steffen-schroeder-by unpinned this issue Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants