Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kartothek 4.0 update_dataset_from_ddf corrupts datasets if a different table name is used #445

Closed
peter-hoffmann-by opened this issue Mar 31, 2021 · 2 comments

Comments

@peter-hoffmann-by
Copy link

peter-hoffmann-by commented Mar 31, 2021

Problem description

Kartothek 4.0 breaks updating existing datasets that use a different table name then the default.

Expected Behaviour

the table name keyword is respected in updating datasets.

Example code

import dask
dask.config.set(scheduler='synchronous')
import dask.dataframe as dd

ddf = dask.datasets.timeseries()
ddf = ddf.reset_index()
print(ddf.head())
ddf['date'] = ddf.timestamp.dt.date
ddf = ddf.loc[:10]
store_url = f"hfs://testdata"

store = get_store_from_url(store_url)
delayed = update_dataset_from_ddf(ddf, store, "testdata", table='predictions', partition_on=['date'], shuffle=True, num_buckets=5)
res = delayed.compute()

Output:

$ tree testdata/testdata 
testdata/testdata 
└── table 
    ├── _common_metadata 
    ├── date=2000-01-01 
    │   ├── 414bb87213014db0b5d34a68e4a03edb.parquet 
    │   ├── 76ab5551c3dc4929b2676b04f2f972ac.parquet
    │   ├── b0614814e3814aecb76705a0f8e5b238.parquet 
    │   ├── eb430028a14644ac845c036b619fff47.parquet 
    │   └── f886ae72fa824827ad850f4af941b8b0.parquet 

Expected Output:

testdata/testdata 
  └── predictions 

The example above is just a minimal example to show that the the table name is not used in update_dataset_from_ddf. The corruption of dataset occurs if you update an existing (pre 4.0 kartothek dataset) with the new release and you suddenly have two table name in your dataset ('predictions' and 'table').

import dask
import pandas as pd
import dask.dataframe as dd
from datetime import date
from storefact import get_store_from_url
from kartothek.io.dask.dataframe import update_dataset_from_ddf, read_dataset_as_ddf

dask.config.set(scheduler='synchronous')
store_url = f"hfs://testdata"
store = get_store_from_url(store_url)


def create():
    df = pd.DataFrame({"date": [date(2021,1,x) for x in range(1,6)], 'value': range(5)})
    ddf = dask.dataframe.from_pandas(df, npartitions=1)
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, table='predictions', partition_on=['date'])
    res = delayed.compute()

def update():
    df = pd.DataFrame({"date": [date(2021,1,x) for x in range(6,10)], 'value': range(5)})
    ddf = dask.dataframe.from_pandas(df, npartitions=1)
    delayed = update_dataset_from_ddf(ddf, store, dataset_uuid, table='predictions', partition_on=['date'])
    res = delayed.compute()

def validate():
    dataset_uuid = "testdata"
    ddf = read_dataset_as_ddf(dataset_uuid, store, "predictions")
    df = ddf.compute()
    print(df)

if __name__ == '__main__':
    #first run with kartothek3
    create()

    #then run with kartothek4
    #update()
    #validate()

raises the following exception

raceback (most recent call last):
  File "test_kartothek_bug.py", line 48, in <module>
    validate()
  File "test_kartothek_bug.py", line 42, in validate
    ddf = read_dataset_as_ddf(dataset_uuid, store, "predictions")
  File "<decorator-gen-7>", line 2, in read_dataset_as_ddf
  File "python3.8/site-packages/kartothek/io_components/utils.py", line 277, in normalize_args
    return _wrapper(*args, **kwargs)
  File "python3.8/site-packages/kartothek/io_components/utils.py", line 275, in _wrapper
    return function(*args, **kwargs)
  File "python3.8/site-packages/kartothek/io/dask/dataframe.py", line 113, in read_dataset_as_ddf
    delayed_partitions = read_dataset_as_delayed(
  File "python3.8/site-packages/kartothek/io/dask/delayed.py", line 239, in read_dataset_as_delayed
    mps = read_dataset_as_delayed_metapartitions(
  File "<decorator-gen-5>", line 2, in read_dataset_as_delayed_metapartitions
  File "python3.8/site-packages/kartothek/io_components/utils.py", line 277, in normalize_args
    return _wrapper(*args, **kwargs)
  File "python3.8/site-packages/kartothek/io_components/utils.py", line 275, in _wrapper
    return function(*args, **kwargs)
  File "/Users/phoffmann/code/macenv/venv/lib/python3.8/site-packages/kartothek/io/dask/delayed.py", line 217, in read_dataset_as_delayed_metapartitions
    return list(mps)
  File "python3.8/site-packages/kartothek/io_components/read.py", line 102, in dispatch_metapartitions_from_factory
    yield MetaPartition.from_partition(
  File "python3.8/site-packages/kartothek/io_components/metapartition.py", line 426, in from_partition
    file=partition.files[table_name],
KeyError: 'predictions'

Dataset layout

$ tree testdata/testdata
testdata/testdata
├── predictions
│   ├── _common_metadata
│   ├── date=2021-01-01
│   │   └── 160182cebd724a8899f1656bdc6d7628.parquet
│   ├── date=2021-01-02
│   │   └── 2859042309f3498f84efdcf3626a8c93.parquet
│   ├── date=2021-01-03
│   │   └── abca32a940e044fc962da03b1fe55bd5.parquet
│   ├── date=2021-01-04
│   │   └── e133c297ef6c44b4bdcabb92af6567a5.parquet
│   └── date=2021-01-05
│       └── d80e42f70fe2425aa623cc342fb9b1e9.parquet
└── table
    ├── date=2021-01-06
    │   └── e9c0d16f75f14454bfdba8377c34df74.parquet
    ├── date=2021-01-07
    │   └── 25c8c53f0d79410290c9b5427a07da09.parquet
    ├── date=2021-01-08
    │   └── 05c191f27e444244b8275296d1302c32.parquet
    ├── date=2021-01-09
    │   └── 42e38bcd6ab54bae83cffe2dd1548e26.parquet
    └── date=2021-01-10
        └── 40f8305134b74b5f85b356417b027959.parquet

Used versions

kartothek==4.0.0

@hoffmann hoffmann pinned this issue Apr 6, 2021
@peter-hoffmann-by
Copy link
Author

The bug is in Line https://github.com/JDASoftwareGroup/kartothek/blob/master/kartothek/io_components/metapartition.py#L468 which drops the table_name silently and falls back to the default. Constructor should add

        table_name=metapartition.table_name

stephan-hesselmann-by added a commit to stephan-hesselmann-by/kartothek that referenced this issue Apr 8, 2021
When updating a dataset with a table name other than 'table', an additional table named
'table' is erroneously created. This corrupts the dataset. The issue was introduced after
deprecating the table name feature in the 4.0.0 release. The root cause is not passing the
table name as an argument within `partition_on` and `add_metapartition`, which leads to the
default table name "table" being used.
stephan-hesselmann-by added a commit that referenced this issue Apr 12, 2021
When updating a dataset with a table name other than 'table', an additional table named
'table' is erroneously created. This corrupts the dataset. The issue was introduced after
deprecating the table name feature in the 4.0.0 release. The root cause is not passing the
table name as an argument within `partition_on` and `add_metapartition`, which leads to the
default table name "table" being used.
stephan-hesselmann-by added a commit to stephan-hesselmann-by/kartothek that referenced this issue Apr 12, 2021
When updating a dataset with a table name other than 'table', an additional table named
'table' is erroneously created. This corrupts the dataset. The issue was introduced after
deprecating the table name feature in the 4.0.0 release. The root cause is not passing the
table name as an argument within `partition_on` and `add_metapartition`, which leads to the
default table name "table" being used.
@stephan-hesselmann-by
Copy link
Collaborator

Fixed by #451

@stephan-hesselmann-by stephan-hesselmann-by unpinned this issue Apr 13, 2021
ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this issue May 26, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this issue Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
steffen-schroeder-by pushed a commit that referenced this issue Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (#431)"
This reverts commit 032856a.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants