-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LazyFrame properties now require significant compute - update API to reflect this #16328
Comments
Yes. It is.
|
All LazyFrame properties now require significant compute. They will be replaced by methods to reflect this:
The LazyFrame properties will be deprecated. To facilitate writing code that is generic for DataFrames and LazyFrames, DataFrames will have the same methods added. We will include documentation in the deprecation message / docstrings to explain the reason for this change. |
We have decided to add a The With regards to the other properties: these will be handled in a second step (most likely post-1.0.0).
|
Just stumbled on this, could you add in |
Just to confirm, this is not as expensive as |
It is not; this is the part of the compute required to establish the query plan and track schema evolution through it - however, it will not execute that plan 👍 |
Please make a separate issue for that request. I will not add it as part of the initial feature. |
Maybe it's a bit late to raise this concern but the name sounds weird for Isn't |
It sounds a bit strange indeed, but it has to be added to DataFrame to facilitate writing generic code that works for both DataFrame/LazyFrame. There is a difference, because for DataFrame the schema just exists, you only have to get it. For LazyFrame there is compute involved. Whichever name you come up with is going to be inappropriate for either of the two. For DataFrame it sounds a bit strange, but you don't generally have to use it. Just use |
Does this mean a |
So far we have decided against that: Not sure what that would mean for a |
This performance regression has significantly impacted my use case. I’ve observed the regression between versions Could you help me understand why the performance cost of resolving the logical plan for columns only became an issue after this update? Previously, it seemed that the evaluation was necessary but still performed as expected. While I agree with updating the API to reflect this cost, I’m curious about the reasons behind the increased cost. Is it possible to cache the schema, provided the logical plan has not changed? |
Checks
Reproducible example
Log output
No response
Issue description
LazyFrame.columns performs much slower from version 0.20.23. In my simple example above, 0.20.23 is nearly 10x slower than 0.20.21.
In my more realistic data pipeline where I use this function to determine the common columns between two dataframes, this calculation accounts for 92% of my runtime in v0.20.23 and only 0.3% in v0.20.21
Performance of DataFrame.columns appears unchanged.
Expected behavior
Performance of .columns is unchanged
Installed versions
The text was updated successfully, but these errors were encountered: