-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] pyorc does not read string column statistics of cuDF generated files #9313
Comments
This issue has been labeled |
This issue has been labeled |
This is a correctness issue, prioritizing for 22.10. |
Issue #9313 The root cause is that the sum value was encoded as an unsigned int. ORC specs show that the value should be encoded as signed. Because both encode and decode where assuming unsigned encoding, the existing C++ test (OrcStatisticsTest, Basic) was passing even without this fix. Added a Python test that uses a different decode method, so it fails without the fix. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Tobias Ribizel (https://github.com/upsj) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #11740
Additional info related to the pyorc part of the issue: Spark is able to read ORC string column statistics, and uses them for predicate based filtering. |
When reading the statistics for an ORC file written by cuDF, the result for sum is wrong when read using cuDF and absent when using pyorc.
Expected result
Sum statistics contains the sum of lengths of all the strings in the column. We do correctly compute this in libcudf, so it should be present when reading with pyorc and correct when reading with cudf.
There's two issues here:
The text was updated successfully, but these errors were encountered: