You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
deanm0000 opened this issue
Sep 23, 2024
· 1 comment
Labels
A-ioArea: reading and writing databugSomething isn't workingP-mediumPriority: mediumperformancePerformance issues or improvementspythonRelated to Python PolarsrustRelated to Rust Polars
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
importpolarsasplimportpyarrow.parquetaspqpl.DataFrame([
pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('cat.parquet',row_group_size=1)
pq.ParquetFile('cat.parquet').metadata.row_group(0).column(0).statistics<pyarrow._parquet.Statisticsobjectat0x7ff7545f4680>has_min_max: Truemin: amax: c# This should be anull_count: 0distinct_count: Nonenum_values: 1# only 1 value, as expectedphysical_type: BYTE_ARRAYlogical_type: Stringconverted_type (legacy): UTF8
Log output
no stderr
Issue description
When writing a parquet file with a Categorical or Enum type the row group statistics are the overall stats not just what is in the row group. Pyarrow writes the row group statistics correctly
Polars's parquet writer should write the categorical/enum statistics according to the values not the overall categorical. Without this future predicate pushdowns won't be effective.
Installed versions
<pyarrow._parquet.Statistics object at 0x7ff7545b0a90>
has_min_max: True
min: a
max: a
null_count: 0
distinct_count: None
num_values: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
The text was updated successfully, but these errors were encountered:
A-ioArea: reading and writing databugSomething isn't workingP-mediumPriority: mediumperformancePerformance issues or improvementspythonRelated to Python PolarsrustRelated to Rust Polars
Checks
Reproducible example
Log output
Issue description
When writing a parquet file with a Categorical or Enum type the row group statistics are the overall stats not just what is in the row group. Pyarrow writes the row group statistics correctly
Expected behavior
Polars's parquet writer should write the categorical/enum statistics according to the values not the overall categorical. Without this future predicate pushdowns won't be effective.
Installed versions
The text was updated successfully, but these errors were encountered: