Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset corruption after updates (#445) #451

Conversation

stephan-hesselmann-by
Copy link
Collaborator

When updating a dataset with a table name other than 'table', an additional table named
'table' is erroneously created. This corrupts the dataset. The issue was introduced after
deprecating the table name feature in the 4.0.0 release. The root cause is not passing the
table name as an argument within partition_on and add_metapartition, which leads to the
default table name "table" being used.

Description:

Briefly describe the change of behavior

  • Closes #xxxx
  • Changelog entry

When updating a dataset with a table name other than 'table', an additional table named
'table' is erroneously created. This corrupts the dataset. The issue was introduced after
deprecating the table name feature in the 4.0.0 release. The root cause is not passing the
table name as an argument within `partition_on` and `add_metapartition`, which leads to the
default table name "table" being used.
@fjetter
Copy link
Collaborator

fjetter commented Apr 9, 2021

Can we put an end-to-end test in, as well? create a dataset w/ custom table name, update it, ensure it is not corrupt? from this unit test it is not immediately clear that this fixes the problem

@hoffmann hoffmann self-requested a review April 9, 2021 13:44
@stephan-hesselmann-by
Copy link
Collaborator Author

stephan-hesselmann-by commented Apr 9, 2021

I believe the issue described in #445 should be fixed with these changes (scenario where the table name is non-default but equal for create and update). However, I foresee problems when the table name diverges between creation and update, i.e.

    # Create with table name "predictions"
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, table='predictions', partition_on=['date'])

    # Update with default table name
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, partition_on=['date'])

What whould the expected behavior be in such a case? Should the table name be inferred as "predictions"? I believe that will require more modifications to the code.

@stehessel
Copy link

What whould the expected behavior be in such a case? Should the table name be inferred as "predictions"? I believe that will require more modifications to the code.

I will create a follow up issue for that. Merging this for now as the build failure is already on master and unrelated to these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants