LazyFrame properties now require significant compute - update API to reflect this #16328

dmeekpmg · 2024-05-20T07:12:15Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

#%%
import polars as pl
df = pl.LazyFrame({'a': [1, 2, 3]})
#%%
%%timeit
df.columns

v0.20.21: 115 ns ± 1.61 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
v0.20.23: 1 µs ± 8.75 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
v0:20:26: 1.04 µs ± 39.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Log output

No response

Issue description

LazyFrame.columns performs much slower from version 0.20.23. In my simple example above, 0.20.23 is nearly 10x slower than 0.20.21.
In my more realistic data pipeline where I use this function to determine the common columns between two dataframes, this calculation accounts for 92% of my runtime in v0.20.23 and only 0.3% in v0.20.21

Performance of DataFrame.columns appears unchanged.

Expected behavior

Performance of .columns is unchanged

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             Windows-10-10.0.19045-SP0
Python:               3.12.2 (tags/v3.12.2:6abddd9, Feb  6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           0.3.3
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.30
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-05-20T08:43:59Z

Yes. It is.

columns needs to resolve the logical plan and is actually quite expensive. I think we should deprecate a columns and schema as property tags and add them as functions to make more clear that they do non-trivial compute.

stinodego · 2024-05-23T11:56:46Z

All LazyFrame properties now require significant compute. They will be replaced by methods to reflect this:

Before	After
`LazyFrame.columns`	...
`LazyFrame.dtypes`	...
`LazyFrame.schema`	`LazyFrame.collect_schema()`
`LazyFrame.width`	...

The LazyFrame properties will be deprecated.

To facilitate writing code that is generic for DataFrames and LazyFrames, DataFrames will have the same methods added.
The DataFrame properties will not be deprecated. These do not require significant compute.

We will include documentation in the deprecation message / docstrings to explain the reason for this change.

stinodego · 2024-05-27T13:35:31Z

We have decided to add a .collect_schema() method to both DataFrame and LazyFrame.

The LazyFrame.schema property will remain, but it will throw a PerformanceWarning and tell you to use .collect_schema.
The DataFrame.schema property will remain and throw no warnings.

With regards to the other properties: these will be handled in a second step (most likely post-1.0.0).

collect_schema will be updated to return a proper Schema object with accessors for dtypes, column_names, and width. The properties will throw a PerformanceWarning and refer to use the properties on the Schema object.

deanm0000 · 2024-06-11T20:04:33Z

Just stumbled on this, could you add in .collect_schema_async too please?

eitsupi · 2024-06-12T03:26:42Z

Just to confirm, this is not as expensive as collect (or fetch), right?

alexander-beedie · 2024-06-12T05:55:00Z

Just to confirm, this is not as expensive as collect (or fetch), right?

It is not; this is the part of the compute required to establish the query plan and track schema evolution through it - however, it will not execute that plan 👍

stinodego · 2024-06-13T13:37:00Z

Just stumbled on this, could you add in .collect_schema_async too please?

Please make a separate issue for that request. I will not add it as part of the initial feature.

etiennebacher · 2024-06-13T13:42:45Z

Maybe it's a bit late to raise this concern but the name sounds weird for DataFrame. For LazyFrame we can clearly see the collect() / collect_schema() / collect_all() similarity but DataFrame doesn't have collect() so having a collect_schema() is strange.

Isn't get_schema() better?

stinodego · 2024-06-13T13:47:56Z

It sounds a bit strange indeed, but it has to be added to DataFrame to facilitate writing generic code that works for both DataFrame/LazyFrame.

There is a difference, because for DataFrame the schema just exists, you only have to get it. For LazyFrame there is compute involved. Whichever name you come up with is going to be inappropriate for either of the two.

For DataFrame it sounds a bit strange, but you don't generally have to use it. Just use .schema.

mcrumiller · 2024-06-13T15:43:58Z

writing generic code that works for both DataFrame/LazyFrame

Does this mean a DataFrame.collect() no-op should exist?

stinodego · 2024-06-13T21:21:20Z

Does this mean a DataFrame.collect() no-op should exist?

So far we have decided against that:
#7882 (comment)

Not sure what that would mean for a DataFrame.collect_schema.

JacobSantry · 2024-09-25T16:05:39Z

This performance regression has significantly impacted my use case.

I’ve observed the regression between versions 0.20.21 and 0.20.22 and through to the latest 1.8.2 with the .collect_schema().names() syntax.

Could you help me understand why the performance cost of resolving the logical plan for columns only became an issue after this update? Previously, it seemed that the evaluation was necessary but still performed as expected.

While I agree with updating the API to reflect this cost, I’m curious about the reasons behind the increased cost. Is it possible to cache the schema, provided the logical plan has not changed?

dmeekpmg added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 20, 2024

alexander-beedie added the performance Performance issues or improvements label May 20, 2024

ritchie46 removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels May 20, 2024

stinodego added this to the 1.0.0 milestone May 20, 2024

stinodego added A-api Area: changes to the public API accepted Ready for implementation and removed performance Performance issues or improvements labels May 20, 2024

github-project-automation bot added this to Backlog May 23, 2024

github-project-automation bot moved this to Ready in Backlog May 23, 2024

stinodego changed the title ~~Degradation in performance of LazyFrame.columns~~ LazyFrame properties now require significant compute - update API to reflect this May 23, 2024

stinodego self-assigned this May 24, 2024

stinodego moved this from Ready to Next in Backlog May 26, 2024

stinodego added needs decision Awaiting decision by a maintainer accepted Ready for implementation and removed accepted Ready for implementation needs decision Awaiting decision by a maintainer labels May 27, 2024

stinodego mentioned this issue May 31, 2024

docs(python): Include warning in docstrings that accessing LazyFrame properties may be expensive #16618

Merged

stinodego mentioned this issue Jun 11, 2024

feat(python): Add Schema class #16873

Merged

stinodego moved this from Next to In progress in Backlog Jun 11, 2024

eitsupi mentioned this issue Jun 11, 2024

LazyFrame should not have active bindings pola-rs/r-polars#1142

Open

stinodego mentioned this issue Jun 13, 2024

feat(python): Add collect_schema method to LazyFrame and DataFrame #16929

Merged

deanm0000 mentioned this issue Jun 13, 2024

Make collect_schema_async() #16932

Open

stinodego mentioned this issue Jun 14, 2024

feat(python): Add PerformanceWarning to LazyFrame properties #16964

Merged

ritchie46 closed this as completed in #16964 Jun 14, 2024

github-project-automation bot moved this from In progress to Done in Backlog Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LazyFrame properties now require significant compute - update API to reflect this #16328

LazyFrame properties now require significant compute - update API to reflect this #16328

dmeekpmg commented May 20, 2024 •

edited

Loading

ritchie46 commented May 20, 2024

stinodego commented May 23, 2024 •

edited

Loading

stinodego commented May 27, 2024 •

edited

Loading

deanm0000 commented Jun 11, 2024

eitsupi commented Jun 12, 2024

alexander-beedie commented Jun 12, 2024

stinodego commented Jun 13, 2024

etiennebacher commented Jun 13, 2024

stinodego commented Jun 13, 2024 •

edited

Loading

mcrumiller commented Jun 13, 2024

stinodego commented Jun 13, 2024

JacobSantry commented Sep 25, 2024

LazyFrame properties now require significant compute - update API to reflect this #16328

LazyFrame properties now require significant compute - update API to reflect this #16328

Comments

dmeekpmg commented May 20, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented May 20, 2024

stinodego commented May 23, 2024 • edited Loading

stinodego commented May 27, 2024 • edited Loading

deanm0000 commented Jun 11, 2024

eitsupi commented Jun 12, 2024

alexander-beedie commented Jun 12, 2024

stinodego commented Jun 13, 2024

etiennebacher commented Jun 13, 2024

stinodego commented Jun 13, 2024 • edited Loading

mcrumiller commented Jun 13, 2024

stinodego commented Jun 13, 2024

JacobSantry commented Sep 25, 2024

dmeekpmg commented May 20, 2024 •

edited

Loading

stinodego commented May 23, 2024 •

edited

Loading

stinodego commented May 27, 2024 •

edited

Loading

stinodego commented Jun 13, 2024 •

edited

Loading