Fix dataset corruption after updates (#445) #451

stephan-hesselmann-by · 2021-04-08T23:45:42Z

When updating a dataset with a table name other than 'table', an additional table named
'table' is erroneously created. This corrupts the dataset. The issue was introduced after
deprecating the table name feature in the 4.0.0 release. The root cause is not passing the
table name as an argument within partition_on and add_metapartition, which leads to the
default table name "table" being used.

Description:

Briefly describe the change of behavior

Closes #xxxx
Changelog entry

When updating a dataset with a table name other than 'table', an additional table named 'table' is erroneously created. This corrupts the dataset. The issue was introduced after deprecating the table name feature in the 4.0.0 release. The root cause is not passing the table name as an argument within `partition_on` and `add_metapartition`, which leads to the default table name "table" being used.

fjetter · 2021-04-09T08:55:43Z

Can we put an end-to-end test in, as well? create a dataset w/ custom table name, update it, ensure it is not corrupt? from this unit test it is not immediately clear that this fixes the problem

stephan-hesselmann-by · 2021-04-09T14:26:04Z

I believe the issue described in #445 should be fixed with these changes (scenario where the table name is non-default but equal for create and update). However, I foresee problems when the table name diverges between creation and update, i.e.

    # Create with table name "predictions"
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, table='predictions', partition_on=['date'])

    # Update with default table name
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, partition_on=['date'])

What whould the expected behavior be in such a case? Should the table name be inferred as "predictions"? I believe that will require more modifications to the code.

stehessel · 2021-04-12T08:04:09Z

What whould the expected behavior be in such a case? Should the table name be inferred as "predictions"? I believe that will require more modifications to the code.

I will create a follow up issue for that. Merging this for now as the build failure is already on master and unrelated to these changes.

bjoern-meier-by approved these changes Apr 9, 2021

View reviewed changes

Add end-to-end regression test

3764ddb

hoffmann self-requested a review April 9, 2021 13:44

hoffmann approved these changes Apr 9, 2021

View reviewed changes

Fix tests for dask dataframe and delayed backends

9a4de1e

stephan-hesselmann-by merged commit 5520f74 into JDASoftwareGroup:master Apr 12, 2021

This was referenced Apr 12, 2021

kartothek 4.0 update_dataset_from_ddf corrupts datasets if a different table name is used #445

Closed

update_dataset_from_ddf corrupts datasets if table names diverge on create -> update #452

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataset corruption after updates (#445) #451

Fix dataset corruption after updates (#445) #451

stephan-hesselmann-by commented Apr 8, 2021

fjetter commented Apr 9, 2021

stephan-hesselmann-by commented Apr 9, 2021 •

edited

Loading

stehessel commented Apr 12, 2021

Fix dataset corruption after updates (#445) #451

Fix dataset corruption after updates (#445) #451

Conversation

stephan-hesselmann-by commented Apr 8, 2021

Description:

fjetter commented Apr 9, 2021

stephan-hesselmann-by commented Apr 9, 2021 • edited Loading

stehessel commented Apr 12, 2021

stephan-hesselmann-by commented Apr 9, 2021 •

edited

Loading