-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting panic when calling LazyFrame.group_by().map_groups
and intermitten panic when calling LazyFrame.columns
#16385
Comments
Seems like it was introduced sometime between 0.20.22 -> 0.20.23 |
Is this the cause of the issue? |
I managed to make it reproduce with: import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import tempfile
import pandas as pd
from pathlib import Path
import os
# Parameters
num_records = 1000
num_ids = 10
# Generate random data
data = {
'some_id': np.random.randint(0, num_ids, num_records),
'a': np.random.rand(num_records),
'b': np.random.rand(num_records),
'c': np.random.rand(num_records)
}
# Convert to a Pandas DataFrame
df = pd.DataFrame(data)
# Convert Pandas DataFrame to a PyArrow Table
table = pa.Table.from_pandas(df)
# Use a temporary directory for output
with tempfile.TemporaryDirectory() as output_dir:
# Write table to Parquet files partitioned by 'some_id'
pq.write_to_dataset(
table,
root_path=output_dir,
partition_cols=['some_id']
)
print(f"Data generation and partitioning complete. Files are stored in {output_dir}")
print(os.listdir(output_dir))
ldf = pl.scan_parquet(f"{output_dir}/**/*.parquet")
df = ldf.filter(pl.col("some_id").is_in([0, 1, 2, 3])).group_by("some_id").map_groups(
lambda df: df,
schema=None
).collect()
print(df) Not sure if the repro is exactly the same as the columns issue, but I'm guessing it's likely related. |
LazyFrame.columns
LazyFrame.columns
or LazyFrame.group_by().map_groups
LazyFrame.columns
or LazyFrame.group_by().map_groups
LazyFrame.group_by().map_groups
and intermitten panic when calling LazyFrame.columns
I can replicate the error. Out of interest, trying I removed the pandas/numpy stuff from your example just to rule them out as potential issues: import tempfile
import polars as pl
import pyarrow.parquet as pq
with tempfile.TemporaryDirectory() as output_dir:
pq.write_to_dataset(
pl.DataFrame({"some_id": 0, "a": 1}).to_arrow(),
root_path=output_dir,
partition_cols=["some_id"]
)
ldf = pl.scan_parquet(f"{output_dir}/**/*.parquet")
(ldf.filter(pl.col("a").is_in(0))
.group_by("a")
.map_groups(lambda df: df, schema=None))
# thread '<unnamed>' panicked at crates/polars-plan/src/logical_plan/optimizer/predicate_pushdown/mod.rs:356:69:
# called `Option::unwrap()` on a `None` value |
Fork. I've tested your example with this little change and it works. Just one line check if the hive_partition_eval is indeed Some; getting rid of the unwrap call. But I don't dare to open pull request since I have no clue what is the root cause of this. |
I'm guessing your fix just entirely ignores the hive partitioning? Ie. if it's None, it's just not considered at all? @ritchie46 might know why it's happening i'm 60% sure it's related to what i posted earlier. |
Not 100% sure if this is the case, but I believe this gets fixed by #16549 (notably the removal of Default::default() for the hive partition info. I've compiled the latest main and the repro no longer panics and my full repro case seems to print appropriately. |
Fixed by #16549. A |
Checks
Reproducible example
n/a
Log output
--------Version info---------
Polars: 0.20.27
Index type: UInt32
Platform: Linux-5.10.216-182.855.x86_64-x86_64-with-glibc2.26
Python: 3.11.7 (main, Dec 5 2023, 22:00:36) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)]
----Optional dependencies----
adbc_driver_manager:
cloudpickle: 3.0.0
connectorx:
deltalake:
fastexcel:
fsspec: 2024.5.0
gevent:
hvplot:
matplotlib: 3.8.4
nest_asyncio: 1.6.0
numpy: 1.26.4
openpyxl:
pandas: 2.2.2
pyarrow: 16.1.0
pydantic:
pyiceberg:
pyxlsb:
sqlalchemy:
torch:
xlsx2csv:
xlsxwriter: ```
The text was updated successfully, but these errors were encountered: