Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet writer statistics for categorical/enum values have overall min/max instead of row_group min/max #18867

Open
2 tasks done
deanm0000 opened this issue Sep 23, 2024 · 1 comment
Labels
A-io Area: reading and writing data bug Something isn't working P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars

Comments

@deanm0000
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.parquet as pq

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('cat.parquet',row_group_size=1)
pq.ParquetFile('cat.parquet').metadata.row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7ff7545f4680>
  has_min_max: True
  min: a
  max: c # This should be a
  null_count: 0
  distinct_count: None
  num_values: 1 # only 1 value, as expected
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

Log output

no stderr

Issue description

When writing a parquet file with a Categorical or Enum type the row group statistics are the overall stats not just what is in the row group. Pyarrow writes the row group statistics correctly

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('catpa.parquet',row_group_size=1, use_pyarrow=True)
pq.ParquetFile('catpa.parquet').metadata.row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7ff7545b0a90>
  has_min_max: True
  min: a
  max: a # note a here
  null_count: 0
  distinct_count: None
  num_values: 1
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

Expected behavior

Polars's parquet writer should write the categorical/enum statistics according to the values not the overall categorical. Without this future predicate pushdowns won't be effective.

Installed versions

<pyarrow._parquet.Statistics object at 0x7ff7545b0a90>
  has_min_max: True
  min: a
  max: a
  null_count: 0
  distinct_count: None
  num_values: 1
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8
@deanm0000 deanm0000 added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024
@deanm0000 deanm0000 added performance Performance issues or improvements P-medium Priority: medium A-io Area: reading and writing data and removed needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Sep 23, 2024
@deanm0000 deanm0000 added the rust Related to Rust Polars label Sep 23, 2024
@kszlim
Copy link
Contributor

kszlim commented Sep 23, 2024

Would be interesting to see if the Parquet PageIndex is being written appropriately too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Status: Ready
Development

No branches or pull requests

2 participants