-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan_parquet
filter/select optimisation error
#20175
Comments
I experienced this issue too with the same pattern, As a workaround I moved the import polars as pl
lf = pl.scan_parquet("/tmp/files/data*.parquet", rechunk=True)
lf = lf.select(
pl.col(["c0", "c9", "c3"]).filter(
pl.col("c0") >= pl.date(2014, 1, 15),
pl.col("c0") <= pl.date(2014, 1, 20),
)
)
print(lf.collect())
# shape: (8, 3)
# ┌────────────┬─────────┬─────┐
# │ c0 ┆ c9 ┆ c3 │
# │ --- ┆ --- ┆ --- │
# │ date ┆ str ┆ i64 │
# ╞════════════╪═════════╪═════╡
# │ 2014-01-16 ┆ OZWC9XE ┆ 0 │
# │ 2014-01-15 ┆ ODJ8EAE ┆ 411 │
# │ 2014-01-16 ┆ OGDLJJE ┆ 411 │
# │ 2014-01-15 ┆ PXLUJGE ┆ 53 │
# │ 2014-01-15 ┆ S893KLE ┆ 559 │
# │ 2014-01-20 ┆ UL3ZEUE ┆ 557 │
# │ 2014-01-17 ┆ W3JZ3TE ┆ 411 │
# │ 2014-01-15 ┆ WF3ZMKE ┆ 559 │
# └────────────┴─────────┴─────┘ |
Can anybody else repro @peterbuecker-form3's MWE #19944 (comment)? It still errors for me after the fix. import polars as pl
n = 5_000_000
col1_type = pl.Int8
col2_type = pl.Int8
data = {
'col1': [0] * n,
'col2': [0] * n
}
df1 = pl.DataFrame(data, schema={
'col1': col1_type,
'col2': col2_type,
})
df2 = pl.DataFrame(data, schema={
'col2': col2_type,
'col1': col1_type,
})
df1.write_parquet('df1.parquet')
df2.write_parquet('df2.parquet')
df1 = pl.scan_parquet('df1.parquet')
df2 = pl.scan_parquet('df2.parquet')
df = pl.concat([df1, df2], how='diagonal_relaxed')
df.filter(pl.col('col1') >= 0).collect()
# ShapeError: unable to vstack, column names don't match: "col1" and "col2" |
@cmdlineluser I've just compiled from abbad69 using MWE results
$ python3 repro.py
--------Version info---------
Polars: 1.16.0
Index type: UInt32
Platform: macOS-15.1.1-arm64-arm-64bit
Python: 3.12.6 (main, Sep 6 2024, 19:03:47) [Clang 15.0.0 (clang-1500.3.9.4)]
LTS CPU: False
----Optional dependencies----
adbc_driver_manager 1.3.0
altair 5.5.0
boto3 1.35.36
cloudpickle 3.1.0
connectorx 0.4.0
deltalake 0.22.3
fastexcel 0.12.0
fsspec 2024.10.0
gevent 24.11.1
google.auth 2.36.0
great_tables 0.14.0
matplotlib 3.9.3
nest_asyncio 1.6.0
numpy 2.0.2
openpyxl 3.1.5
pandas 2.2.3
pyarrow 18.1.0
pydantic 2.10.3
pyiceberg <not installed>
sqlalchemy 2.0.36
torch <not installed>
xlsx2csv 0.8.4
xlsxwriter 3.2.0
Traceback (most recent call last):
File "/Users/me/go/src/github.com/pola-rs/polars/repro.py", line 31, in <module>
df.filter(pl.col('col1') >= 0).collect()
File "/Users/me/go/src/github.com/pola-rs/polars/py-polars/polars/lazyframe/frame.py", line 2030, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ShapeError: unable to vstack, column names don't match: "col1" and "col2" |
Taking a look. |
Doesn't reproduce for me anymore. |
@peterbuecker-form3 It no longer reproduces for me after #20189
All of the examples mentioned in both issues now run for me without error. Thanks all <3 |
Thanks a lot @coastalwhite @ritchie46 @nameexhaustion @cmdlineluser, that was a very quick fix ❤️ Can confirm the issue is solved in v1.17.0 🥇 ! |
Checks
Reproducible example
Unzip the attached sample files; there are three ("data0.parquet", "data1.parquet", "data2.parquet"), each with only 8 rows:
parquet_test_files.zip
Update the directory in the below
scan_parquet
call to point to the unzipped parquet:Log output
Issue description
Applying the same query plan to the files individually shows the following results:
Note that the requested
select
column order is correct for the first/last files (which return no results), but is incorrect for the middle file (which contains all of the matching rows). This points to the cause of theShapeError
, as the internal vstack call receives misaligned columns.Modifying the filter in any way (omitting it, removing one of the two conditions, or replacing it with an
is_between
) causes the query to succeed (though usingis_between
still fails on the real -much larger- data).Omitting the
select
or disablingpredicate_pushdown
also causes the query to succeed.Files were written by Polars, using
sink_parquet
.Expected behavior
Return the results without raising a
ShapeError
.Note
I think this represents a clean MWE of #19944.
Installed versions
Latest compiled
head
branch.The text was updated successfully, but these errors were encountered: