Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow .collect() to be used on non-lazy DataFrames (as a no-op) #7882

Closed
rodonn opened this issue Mar 30, 2023 · 8 comments
Closed

Allow .collect() to be used on non-lazy DataFrames (as a no-op) #7882

rodonn opened this issue Mar 30, 2023 · 8 comments
Assignees
Labels
enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics

Comments

@rodonn
Copy link

rodonn commented Mar 30, 2023

Problem description

I often have Polars pipelines that I would like to be able to easily switch between taking a DataFrame vs. a LazyFrame as an input.

This would be easier to reuse the same code if DataFrame had a .collect() method that does not do anything. This way I could easily make a function to implement a pipeline that works with both LazyFrame and DataFrames.

e.g.

def pipeline(df: Union[pl.DataFrame, pl.LazyFrame]) -> pl.DataFrame:
    return (df
        .groupby("group_id")
        .agg([
            pl.col("volume").sum().alias("volume"),
        ])
        .sort(pl.col("volume").desc())
        .collect()
    )

I'd be happy to submit a PR for this, but wanted to check if there are any concerns or objections first.

@rodonn rodonn added the enhancement New feature or an improvement of an existing feature label Mar 30, 2023
@mcrumiller
Copy link
Contributor

Ahh I'd love this. I have a lot of:

if isinstance(df, pl.LazyFrame):
   df = df.collect()

scattered throughout my codebase.

@s-banach
Copy link
Contributor

This came up before.
The suggestion is to write df.lazy().collect().
It seems to be a code smell to not know if a frame is lazy or not, because you’re liable to collect the same result more than once. And if the performance impact of that doesn’t concern you, then you don’t need lazy frames in the first place.

@StefanBRas
Copy link
Contributor

It's a duplicate of #6846.

This came up before. The suggestion is to write df.lazy().collect(). It seems to be a code smell to not know if a frame is lazy or not, because you’re liable to collect the same result more than once. And if the performance impact of that doesn’t concern you, then you don’t need lazy frames in the first place.

I didn't understand the df.lazy().collect() to be the suggested way to do it, just something that can be used.
I have many cases where I explicitly know that the frame can be either, especially when I write test assertion functions.

@mcrumiller
Copy link
Contributor

@StefanBRas that suggestion just moves the no-op from one place to another. Calling .collect() on an eager frame being a no-op versus calling .lazy() on a lazy frame being a no-op.

@StefanBRas
Copy link
Contributor

@StefanBRas that suggestion just moves the no-op from one place to another. Calling .collect() on an eager frame being a no-op versus calling .lazy() on a lazy frame being a no-op.

@mcrumiller I'm not sure I follow exactly what you're replying to. Just to be clear, I would like to have eager_frame.collect(). The df.lazy.collect() is not my suggestion, but something that was mentioned in the linked issue.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Apr 2, 2023

I wonder if having pl.collect and pl.lazy functions would be cleaner than polluting the DataFrame/LazyFrame namespaces with no-ops 🤔

Each would work with DataFrame/LazyFrame input, but ignore the inapplicable class and return as-is; allows for easy construction of utility functions/pipelines that can work with both (which is the goal) but keeps the actual classmethods focused, and everyone who wants/needs this behaviour doesn't have to roll their own version...

@stinodego
Copy link
Member

After spending some time grinding out a lot of data processing pipelines in Polars, I am convinced this is a good idea.

A lot of my code is agnostic about eager/lazy, but there are parts where I want to enforce eager computation.

DataFrame.collect seems fine with me - having a single no-op isn't too bad, right? Makes things more streamlined than having multiple ways to collect something (lf.collect() and pl.collect(lf) - now you'll get disagreements on the "right" way to collect).

@stinodego
Copy link
Member

stinodego commented Jul 2, 2023

For now, we have decided not to implement this. See discussion in the linked PR #9676. Quoting Ritchie:

What I want is to fail early if the user had the assumption that it ran in lazy mode, but it was actually in eager mode.

There is an aysmmetry in that. Assuming you were in eager, but you were in lazy is generally less expensive.

We would now give up strictness in production to make debugging easier where it is very easy for a user to circumvent. They could add a no-op collect for debugging purposes. I think types should be respected as much as possible and rewriting some code during debugging is to be expected. I love rust because of this. Now we cannot fail on compile time in polars, but we should fail before we run the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature reference Reference issue for recurring topics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants