Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on polars.scan_parquet().filter().columns #16147

Closed
2 tasks done
jmakov opened this issue May 10, 2024 · 9 comments
Closed
2 tasks done

Panic on polars.scan_parquet().filter().columns #16147

jmakov opened this issue May 10, 2024 · 9 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@jmakov
Copy link

jmakov commented May 10, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

lf = polars.scan_parquet()
lf.columns  # works
lf.filter(col1 > 123).columns  # panics

Log output

thread 'python' panicked at crates/polars-plan/src/logical_plan/optimizer/predicate_pushdown/mod.rs:359:69:
called `Option::unwrap()` on a `None` value

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[10], line 1
----> 1 _lf.filter(polars.col("ts_event")>123).columns

File ~/mambaforge-pypy3/envs/quantlab/lib/python3.11/site-packages/polars/lazyframe/frame.py:411, in LazyFrame.columns(self)
    394 @property
    395 def columns(self) -> list[str]:
    396     """
    397     Get column names.
    398 
   (...)
    409     ['foo', 'bar']
    410     """
--> 411     return self._ldf.columns()

PanicException: called `Option::unwrap()` on a `None` value

Issue description

Panic on polars.scan_parquet().filter().columns. Also why are you calling unwrap() in production code?

Expected behavior

Printout of columns

Installed versions

Polars:               0.20.25
Index type:           UInt32
Platform:             Linux-6.6.25-1-MANJARO-x86_64-with-glibc2.39
Python:               3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               24.2.1
hvplot:               0.10.0
matplotlib:           3.7.5
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              14.0.2
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.30
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@jmakov jmakov added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 10, 2024
@ritchie46
Copy link
Member

Have you got a repro with a dummy file?

@jmakov
Copy link
Author

jmakov commented May 10, 2024

No. Tried this but it works so I'd have to look a bit deeper what triggers the problem:

# this works
df = polars.DataFrame({"col1": range(10)})
df.write_parquet("test.parquet")

lf = polars.scan_parquet("test.parquet")
lf.filter(polars.col("col1") > 3).columns

The original parquet dataset is partitioned. And this fails:

# prquet dataset structured as /mnt/data/schema_name/partition1/partition2/year/month.parquet
lf = polars.scan_parquet(os.sep.join(["/mnt/data", 
                                          "some_schema",
                                          f"patition1={var1}", 
                                          f"patition2={var2}",
                                          "*", "*.parquet"]))
lf.filter(polars.col("col1") > 123).columns

@ritchie46 ritchie46 added incomplete and removed needs triage Awaiting prioritization by a maintainer labels May 14, 2024
@ATL2001
Copy link

ATL2001 commented May 15, 2024

@jmakov any chance you could try using polars 0.20.19 and see if your parquet file works with that version? I recently upgraded from that version to 0.20.25 and am now having the same issue you're seeing (using the same parquet files that worked with 0.20.19). Unfortunately, I'm having a hard time creating a MRE.

@jmakov
Copy link
Author

jmakov commented May 15, 2024

@ATL2001 thanks for the tip. You're right, 0.20.19 works. I also had hard time investigating and recreating a MRE, don't have enough time for that. But at least we know now it's a regression. Thanks!

@jmakov
Copy link
Author

jmakov commented May 24, 2024

Still present in version 0.20.29

@cmdlineluser
Copy link
Contributor

There is a minimal repro here for a different issue:

But it is also about partitioned datasets, and the same error.

It may be the same underlying problem as described here.

@kszlim
Copy link
Contributor

kszlim commented May 28, 2024

My repro for this seems to have been fixed on main.

Not 100% sure if this is the case, but I believe this gets fixed by #16549 (notably the removal of Default::default() for the hive partition info.

I have integration tests in my code which encountered this exact bug and it seems to have been fixed when I compiled main too.

@ritchie46
Copy link
Member

Not 100% sure if this is the case, but I believe this gets fixed by #16549 (notably the removal of Default::default() for the hive partition info.

Yes, that's the case.

@ATL2001
Copy link

ATL2001 commented Jun 5, 2024

Thanks everyone! I just upgraded to 0.20.31, and the panic is gone! 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

5 participants