Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

Open
2 tasks done
deanm0000 opened this issue Sep 23, 2024 · 2 comments
Open
2 tasks done

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

deanm0000 opened this issue Sep 23, 2024 · 2 comments
Labels
A-io-parquet Area: reading/writing Parquet files bug Something isn't working P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars

Comments

@deanm0000
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('catpa.parquet',row_group_size=1, use_pyarrow=True)
with pl.Config(verbose=True):
    pl.scan_parquet("catpa.parquet").filter(pl.col('a')=='a').collect()

Log output

parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.

Issue description

This is separate from but highly related to #18867. Even when using a file written by pyarrow where the statistics are correct, predicate pushdown doesn't work.

If I try to explicitly make the rhs a Categorical then I simply don't get a verbose message at all so I'm not sure if it's silently working or not working.

with pl.Config(verbose=True):
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Even with a StringCache still no verbosity.

with pl.Config(verbose=True), pl.StringCache():
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Expected behavior

Partition pruning should work

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           0.3.2
deltalake            0.18.2
fastexcel            <not installed>
fsspec               2024.3.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@deanm0000 deanm0000 added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024
@aberres
Copy link
Contributor

aberres commented Sep 23, 2024

The same happens with enum columns.

@deanm0000 deanm0000 changed the title Parquet scanner doesn't do predicate pushdown for categoricals Parquet scanner doesn't do predicate pushdown for categoricals/enums Sep 23, 2024
@deanm0000 deanm0000 added P-medium Priority: medium A-io Area: reading and writing data and removed needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Sep 23, 2024
@deanm0000 deanm0000 added performance Performance issues or improvements rust Related to Rust Polars labels Sep 23, 2024
@aberres
Copy link
Contributor

aberres commented Nov 29, 2024

Some numbers from a real live example:

CleanShot 2024-11-29 at 15 07 15@2x

What I am reading is a ~100Mb Parquet representing a ~30Gb in memory frame. The filter matches no rows.

In the first case, the "Assumption" column has been written as string column, in the second case as categorical column. The data is residing in a Google Cloud bucket in Singapore, this means I/O is costly.

As we see, the missing push-down has a dramatic effect.

@alexander-beedie alexander-beedie added A-io-parquet Area: reading/writing Parquet files and removed A-io Area: reading and writing data labels Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files bug Something isn't working P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Status: Ready
Development

No branches or pull requests

3 participants