[BUG] pyorc does not read string column statistics of cuDF generated files #9313

devavret · 2021-09-27T11:01:54Z

When reading the statistics for an ORC file written by cuDF, the result for sum is wrong when read using cuDF and absent when using pyorc.

In [1]: import cudf

In [2]: import pyorc

In [3]: gdf = cudf.DataFrame({'b':[1,7], 'a':['Badam khao', 'roz']})

In [4]: gdf.to_orc("temp.orc")

In [5]: cudf.io.orc.read_orc_statistics(["temp.orc"])
Out[5]: 
([{'col0': {'number_of_values': 2},
   'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
   'a': {'number_of_values': 2,
    'minimum': 'Badam khao',
    'maximum': 'roz',
    'sum': -7}}],
 [{'col0': {'number_of_values': 2},
   'b': {'number_of_values': 2, 'minimum': 1, 'maximum': 7, 'sum': 8},
   'a': {'number_of_values': 2,
    'minimum': 'Badam khao',
    'maximum': 'roz',
    'sum': -7}}])

In [6]: f = open("temp.orc", 'rb')

In [7]: r = pyorc.Reader(f)

In [8]: r[1].statistics
Out[8]: 
{'has_null': False,
 'number_of_values': 2,
 'minimum': 1,
 'maximum': 7,
 'sum': 8,
 'kind': <TypeKind.LONG: 4>}

In [9]: r[2].statistics
Out[9]: {'has_null': False, 'number_of_values': 2, 'kind': <TypeKind.STRING: 7>}

Expected result

Sum statistics contains the sum of lengths of all the strings in the column. We do correctly compute this in libcudf, so it should be present when reading with pyorc and correct when reading with cudf.

There's two issues here:

String sum statistics are encoded incorrectly (will be fixed by Fix ORC string sum statistics #11740)
pyroc does not read cuDF-written ORC string statistics

github-actions · 2021-11-15T21:03:21Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-02-13T22:03:08Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vuule · 2022-08-02T20:43:00Z

This is a correctness issue, prioritizing for 22.10.

Issue #9313 The root cause is that the sum value was encoded as an unsigned int. ORC specs show that the value should be encoded as signed. Because both encode and decode where assuming unsigned encoding, the existing C++ test (OrcStatisticsTest, Basic) was passing even without this fix. Added a Python test that uses a different decode method, so it fails without the fix. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Tobias Ribizel (https://github.com/upsj) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #11740

vuule · 2022-09-26T21:00:37Z

Additional info related to the pyorc part of the issue: Spark is able to read ORC string column statistics, and uses them for predicate based filtering.

devavret added bug Something isn't working cuIO cuIO issue labels Sep 27, 2021

github-actions bot added the inactive-30d label Nov 15, 2021

github-actions bot added the inactive-90d label Feb 13, 2022

GregoryKimball added the good first issue Good for newcomers label Jun 28, 2022

vuule self-assigned this Aug 3, 2022

vuule mentioned this issue Sep 22, 2022

Fix ORC string sum statistics #11740

Merged

3 tasks

vuule removed inactive-30d good first issue Good for newcomers labels Sep 26, 2022

vuule removed their assignment Sep 27, 2022

vuule changed the title ~~[BUG] ORC string sum statistics are wrong~~ [BUG] pyorc does not read string column statistics of cuDF generated files Sep 27, 2022

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] pyorc does not read string column statistics of cuDF generated files #9313

[BUG] pyorc does not read string column statistics of cuDF generated files #9313

devavret commented Sep 27, 2021 •

edited by vuule

Loading

github-actions bot commented Nov 15, 2021

github-actions bot commented Feb 13, 2022

vuule commented Aug 2, 2022

vuule commented Sep 26, 2022

[BUG] pyorc does not read string column statistics of cuDF generated files #9313

[BUG] pyorc does not read string column statistics of cuDF generated files #9313

Comments

devavret commented Sep 27, 2021 • edited by vuule Loading

Expected result

github-actions bot commented Nov 15, 2021

github-actions bot commented Feb 13, 2022

vuule commented Aug 2, 2022

vuule commented Sep 26, 2022

devavret commented Sep 27, 2021 •

edited by vuule

Loading