Skip to content

Commit

Permalink
Fix dataset corruption after updates (JDASoftwareGroup#445)
Browse files Browse the repository at this point in the history
When updating a dataset with a table name other than 'table', an additional table named
'table' is erroneously created. This corrupts the dataset. The issue was introduced after
deprecating the table name feature in the 4.0.0 release. The root cause is not passing the
table name as an argument within `partition_on` and `add_metapartition`, which leads to the
default table name "table" being used.
  • Loading branch information
stephan-hesselmann-by committed Apr 8, 2021
1 parent 08a8094 commit c347cf8
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 0 deletions.
6 changes: 6 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
Changelog
=========

Kartothek 4.0.1 (2021-04-XX)
============================

* Fixed dataset corruption after updates when table names other than "table" are used (#445).


Kartothek 4.0.0 (2021-03-17)
============================

Expand Down
2 changes: 2 additions & 0 deletions kartothek/io_components/metapartition.py
Original file line number Diff line number Diff line change
Expand Up @@ -471,6 +471,7 @@ def add_metapartition(
schema=schema,
partition_keys=metapartition.partition_keys or None,
logical_conjunction=metapartition.logical_conjunction or None,
table_name=metapartition.table_name,
)

# Add metapartition information to the new object
Expand Down Expand Up @@ -1109,6 +1110,7 @@ def partition_on(self, partition_on: Union[str, Sequence[str]]):
f"{label}"
),
partition_keys=partition_on,
table_name=self.table_name,
)
new_mp = new_mp.add_metapartition(tmp_mp, schema_validation=False)
if self.indices:
Expand Down
10 changes: 10 additions & 0 deletions tests/io_components/test_metapartition.py
Original file line number Diff line number Diff line change
Expand Up @@ -1331,3 +1331,13 @@ def test_get_parquet_metadata_row_group_size(store):
}
)
pd.testing.assert_frame_equal(actual, expected)


def test_partition_on_keeps_table_name():
mp = MetaPartition(
label="label_1",
data=pd.DataFrame({"P": [1, 2, 1, 2], "L": [1, 1, 2, 2]}),
table_name="non-default-name",
)
repartitioned_mp = mp.partition_on(["P"])
assert repartitioned_mp.table_name == "non-default-name"

0 comments on commit c347cf8

Please sign in to comment.