Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between pl.all_horizontal() and pl.Expr.and_() when reading parquet written by LazyFrame.sink_parquet() #21204

Closed
2 tasks done
jkc1 opened this issue Feb 11, 2025 · 3 comments · Fixed by #21310
Closed
2 tasks done
Assignees
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@jkc1
Copy link

jkc1 commented Feb 11, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"dataset": ["train"]*100 + ["valid"]*100})
df.lazy().sink_parquet("sink.parquet")
df.write_parquet("write.parquet")
lf_sink = pl.scan_parquet("sink.parquet")
lf_write = pl.scan_parquet("write.parquet")

sink_all = lf_sink.filter(pl.all_horizontal(pl.col("dataset")=="train")).collect()
sink_lit = lf_sink.filter(pl.lit(True).and_(pl.col("dataset")=="train")).collect()

write_all = lf_write.filter(pl.all_horizontal(pl.col("dataset")=="train")).collect() 
write_lit = lf_write.filter(pl.lit(True).and_(pl.col("dataset")=="train")).collect()

print(f"{sink_all=}")
print(f"{sink_lit=}")
print(f"{write_all=}")
print(f"{write_lit=}")

Log output

$ POLARS_VERBOSE=1 python repro.py 
try_get_writeable: local: sink.parquet (canonicalize: Ok("/Users/.../Desktop/polars_bug/sink.parquet"))
RUN STREAMING PIPELINE
[df -> parquet_sink]
try_get_writeable: local: write.parquet (canonicalize: Ok("/Users/.../Desktop/polars_bug/write.parquet"))
parquet scan with parallel = Prefiltered
parquet live columns = 1, dead columns = 0
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet row group must be read, statistics not sufficient for predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet scan with parallel = Prefiltered
parquet live columns = 1, dead columns = 0
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet row group can be skipped, the statistics were sufficient to apply the predicate.
parquet scan with parallel = None
parquet row group must be read, statistics not sufficient for predicate.
parquet scan with parallel = None
parquet row group must be read, statistics not sufficient for predicate.
sink_all=shape: (100, 1)
┌─────────┐
│ dataset │
│ ---     │
│ str     │
╞═════════╡
│ train   │
│ train   │
│ train   │
│ train   │
│ train   │
│ …       │
│ train   │
│ train   │
│ train   │
│ train   │
│ train   │
└─────────┘
sink_lit=shape: (0, 1)
┌─────────┐
│ dataset │
│ ---     │
│ str     │
╞═════════╡
└─────────┘
write_all=shape: (100, 1)
┌─────────┐
│ dataset │
│ ---     │
│ str     │
╞═════════╡
│ train   │
│ train   │
│ train   │
│ train   │
│ train   │
│ …       │
│ train   │
│ train   │
│ train   │
│ train   │
│ train   │
└─────────┘
write_lit=shape: (100, 1)
┌─────────┐
│ dataset │
│ ---     │
│ str     │
╞═════════╡
│ train   │
│ train   │
│ train   │
│ train   │
│ train   │
│ …       │
│ train   │
│ train   │
│ train   │
│ train   │
│ train   │
└─────────┘

Issue description

Logically, filtering using pl.all_horizontal(*exprs) and the uglier pl.lit(True).and_(*exprs) should produce the same result.

After much debugging, I've discovered that they are not, specifically in the case where the underlying data is a LazyFrame, specifically one scanned from a parquet previously written using pl.sink_parquet().

Expected behavior

The behavior for all four cases should be the same.

Installed versions

--------Version info---------
Polars:              1.22.0
Index type:          UInt32
Platform:            macOS-15.3-arm64-arm-64bit
Python:              3.12.8 (main, Dec  4 2024, 09:49:16) [Clang 16.0.0 (clang-1600.0.26.4)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                <not installed>
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

@jkc1 jkc1 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 11, 2025
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Feb 11, 2025

Can reproduce.

Just some notes in case they are of relevance:

It seems to be non-deterministic for me.

lf_sink.filter(pl.lit(True).and_(pl.col("dataset")=="train")).collect()
# shape: (0, 1)
# ┌─────────┐
# │ dataset │
# │ ---     │
# │ str     │
# ╞═════════╡
# └─────────┘
lf_sink.filter(pl.lit(True).and_(pl.col("dataset")=="train")).collect()
# shape: (100, 1)
# ┌─────────┐
# │ dataset │
# │ ---     │
# │ str     │
# ╞═════════╡
# │ train   │
# │ train   │
# │ train   │
# │ train   │
# │ train   │
# │ …       │

The execution plan seems to change randomly.

lf_sink.filter(pl.lit(True).and_(pl.col("dataset")=="train")).explain()
# 'Parquet SCAN [sink.parquet]\nPROJECT */1 COLUMNS\nSELECTION: [(col("dataset")) == (String(train))]'
lf_sink.filter(pl.lit(True).and_(pl.col("dataset")=="train")).explain()
# 'Parquet SCAN [sink.parquet]\nPROJECT */1 COLUMNS\nSELECTION: [(true) & ([(col("dataset")) == (String(train))])]'

.collect(predicate_pushdown=False) always produces the correct result.

@coastalwhite coastalwhite added A-io-parquet Area: reading/writing Parquet files P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Feb 12, 2025
@jkc1
Copy link
Author

jkc1 commented Feb 12, 2025

Thanks for confirming. I can also verify that this is new as of 1.22.0 as I cannot reproduce on 1.21.0

@coastalwhite
Copy link
Collaborator

coastalwhite commented Feb 18, 2025

Here is an even smaller MRE. It doesn't have anything to do with the streaming engine.

import polars as pl
from polars.testing import assert_frame_equal
import io

f = io.BytesIO()

df = pl.DataFrame({ 'a': [1] })
df.write_parquet(f)
f.seek(0)
lf = pl.scan_parquet(f).filter(pl.lit(True))
assert_frame_equal(lf.collect(), df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants