Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

deanm0000 · 2024-09-23T15:42:31Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('catpa.parquet',row_group_size=1, use_pyarrow=True)
with pl.Config(verbose=True):
    pl.scan_parquet("catpa.parquet").filter(pl.col('a')=='a').collect()

Log output

parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.

Issue description

This is separate from but highly related to #18867. Even when using a file written by pyarrow where the statistics are correct, predicate pushdown doesn't work.

If I try to explicitly make the rhs a Categorical then I simply don't get a verbose message at all so I'm not sure if it's silently working or not working.

with pl.Config(verbose=True):
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Even with a StringCache still no verbosity.

with pl.Config(verbose=True), pl.StringCache():
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Expected behavior

Partition pruning should work

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           0.3.2
deltalake            0.18.2
fastexcel            <not installed>
fsspec               2024.3.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

aberres · 2024-09-23T15:48:30Z

The same happens with enum columns.

aberres · 2024-11-29T14:14:37Z

Some numbers from a real live example:

What I am reading is a ~100Mb Parquet representing a ~30Gb in memory frame. The filter matches no rows.

In the first case, the "Assumption" column has been written as string column, in the second case as categorical column. The data is residing in a Google Cloud bucket in Singapore, this means I/O is costly.

As we see, the missing push-down has a dramatic effect.

deanm0000 added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024

deanm0000 changed the title ~~Parquet scanner doesn't do predicate pushdown for categoricals~~ Parquet scanner doesn't do predicate pushdown for categoricals/enums Sep 23, 2024

deanm0000 added P-medium Priority: medium A-io Area: reading and writing data and removed needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024

github-project-automation bot added this to Backlog Sep 23, 2024

github-project-automation bot moved this to Ready in Backlog Sep 23, 2024

deanm0000 added performance Performance issues or improvements rust Related to Rust Polars labels Sep 23, 2024

alexander-beedie added A-io-parquet Area: reading/writing Parquet files and removed A-io Area: reading and writing data labels Nov 29, 2024

coastalwhite mentioned this issue Dec 1, 2024

Tracking issue for better Enum / Categorical support in Polars Parquet #20089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

deanm0000 commented Sep 23, 2024

aberres commented Sep 23, 2024

aberres commented Nov 29, 2024

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

Comments

deanm0000 commented Sep 23, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

aberres commented Sep 23, 2024

aberres commented Nov 29, 2024