Allow .collect() to be used on non-lazy DataFrames (as a no-op) #7882

rodonn · 2023-03-30T17:37:13Z

Problem description

I often have Polars pipelines that I would like to be able to easily switch between taking a DataFrame vs. a LazyFrame as an input.

This would be easier to reuse the same code if DataFrame had a .collect() method that does not do anything. This way I could easily make a function to implement a pipeline that works with both LazyFrame and DataFrames.

e.g.

def pipeline(df: Union[pl.DataFrame, pl.LazyFrame]) -> pl.DataFrame:
    return (df
        .groupby("group_id")
        .agg([
            pl.col("volume").sum().alias("volume"),
        ])
        .sort(pl.col("volume").desc())
        .collect()
    )

I'd be happy to submit a PR for this, but wanted to check if there are any concerns or objections first.

The text was updated successfully, but these errors were encountered:

mcrumiller · 2023-03-30T21:16:55Z

Ahh I'd love this. I have a lot of:

if isinstance(df, pl.LazyFrame):
   df = df.collect()

scattered throughout my codebase.

s-banach · 2023-03-31T01:52:22Z

This came up before.
The suggestion is to write df.lazy().collect().
It seems to be a code smell to not know if a frame is lazy or not, because you’re liable to collect the same result more than once. And if the performance impact of that doesn’t concern you, then you don’t need lazy frames in the first place.

StefanBRas · 2023-03-31T08:13:26Z

It's a duplicate of #6846.

This came up before. The suggestion is to write df.lazy().collect(). It seems to be a code smell to not know if a frame is lazy or not, because you’re liable to collect the same result more than once. And if the performance impact of that doesn’t concern you, then you don’t need lazy frames in the first place.

I didn't understand the df.lazy().collect() to be the suggested way to do it, just something that can be used.
I have many cases where I explicitly know that the frame can be either, especially when I write test assertion functions.

mcrumiller · 2023-03-31T20:14:41Z

@StefanBRas that suggestion just moves the no-op from one place to another. Calling .collect() on an eager frame being a no-op versus calling .lazy() on a lazy frame being a no-op.

StefanBRas · 2023-04-02T10:30:50Z

@StefanBRas that suggestion just moves the no-op from one place to another. Calling .collect() on an eager frame being a no-op versus calling .lazy() on a lazy frame being a no-op.

@mcrumiller I'm not sure I follow exactly what you're replying to. Just to be clear, I would like to have eager_frame.collect(). The df.lazy.collect() is not my suggestion, but something that was mentioned in the linked issue.

alexander-beedie · 2023-04-02T12:03:45Z

I wonder if having pl.collect and pl.lazy functions would be cleaner than polluting the DataFrame/LazyFrame namespaces with no-ops 🤔

Each would work with DataFrame/LazyFrame input, but ignore the inapplicable class and return as-is; allows for easy construction of utility functions/pipelines that can work with both (which is the goal) but keeps the actual classmethods focused, and everyone who wants/needs this behaviour doesn't have to roll their own version...

stinodego · 2023-04-07T14:47:28Z

After spending some time grinding out a lot of data processing pipelines in Polars, I am convinced this is a good idea.

A lot of my code is agnostic about eager/lazy, but there are parts where I want to enforce eager computation.

DataFrame.collect seems fine with me - having a single no-op isn't too bad, right? Makes things more streamlined than having multiple ways to collect something (lf.collect() and pl.collect(lf) - now you'll get disagreements on the "right" way to collect).

stinodego · 2023-07-02T14:33:48Z

For now, we have decided not to implement this. See discussion in the linked PR #9676. Quoting Ritchie:

What I want is to fail early if the user had the assumption that it ran in lazy mode, but it was actually in eager mode.

There is an aysmmetry in that. Assuming you were in eager, but you were in lazy is generally less expensive.

We would now give up strictness in production to make debugging easier where it is very easy for a user to circumvent. They could add a no-op collect for debugging purposes. I think types should be respected as much as possible and rewriting some code during debugging is to be expected. I love rust because of this. Now we cannot fail on compile time in polars, but we should fail before we run the query.

rodonn added the enhancement New feature or an improvement of an existing feature label Mar 30, 2023

stinodego self-assigned this Apr 7, 2023

This was referenced Jul 2, 2023

Implement pl.DataFrame.collect similarly to pl.LazyFrame.lazy #6846

Closed

feat(python): add DataFrame.collect (no-op) #9676

Closed

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2023

MarcoGorelli mentioned this issue Jan 9, 2024

Implement .collect() as alias for .clone() for DataFrame #13549

Closed

stinodego added the reference Reference issue for recurring topics label Jun 13, 2024

stinodego mentioned this issue Jun 13, 2024

LazyFrame properties now require significant compute - update API to reflect this #16328

Closed

2 tasks

cmdlineluser mentioned this issue Oct 31, 2024

Feature request: Add collect() to DataFrame #19548

Open

cmdlineluser mentioned this issue Nov 26, 2024

Can we create a .collect method for polars.DataFrame? #19995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow .collect() to be used on non-lazy DataFrames (as a no-op) #7882

Allow .collect() to be used on non-lazy DataFrames (as a no-op) #7882

rodonn commented Mar 30, 2023

mcrumiller commented Mar 30, 2023

s-banach commented Mar 31, 2023

StefanBRas commented Mar 31, 2023

mcrumiller commented Mar 31, 2023

StefanBRas commented Apr 2, 2023

alexander-beedie commented Apr 2, 2023 •

edited

Loading

stinodego commented Apr 7, 2023

stinodego commented Jul 2, 2023 •

edited

Loading

Allow .collect() to be used on non-lazy DataFrames (as a no-op) #7882

Allow .collect() to be used on non-lazy DataFrames (as a no-op) #7882

Comments

rodonn commented Mar 30, 2023

Problem description

mcrumiller commented Mar 30, 2023

s-banach commented Mar 31, 2023

StefanBRas commented Mar 31, 2023

mcrumiller commented Mar 31, 2023

StefanBRas commented Apr 2, 2023

alexander-beedie commented Apr 2, 2023 • edited Loading

stinodego commented Apr 7, 2023

stinodego commented Jul 2, 2023 • edited Loading

alexander-beedie commented Apr 2, 2023 •

edited

Loading

stinodego commented Jul 2, 2023 •

edited

Loading