Performance: Calculation 200x+ slower without rechunking or sorting first #17637

Julian-J-S · 2024-07-15T11:16:23Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df_sales = pl.read_parquet(PATH_SALES)
# simple df (10_000_000, 5)

df_sales.with_columns(
    time_between_sales=pl.col("created_on")
    .diff()
    .over(
        partition_by="customer_id",
        order_by="created_on",
    ),
)

Log output

No response

Issue description

I created a simple sales_df with polars that I saved as parquet.
When I read the data back in and do a simple window calculation the performance is super slow (~70 sec).

However if I

rechunk() or
.sort("created_on", "customer_id")

first the performance is ~200x++ faster (~0.3 sec)

possibly related: #17562

Expected behavior

Great Performance as usual 😄

When I read a df that I just saved using polars I expect good performance without need to explicitly rechunk myself.

Installed versions

1.1.0

The text was updated successfully, but these errors were encountered:

Julian-J-S · 2024-07-15T12:27:13Z

Reproducable Dataset:

on my pc

fast with N=1_000_000
slow with N=10_000_000

import numpy as np
import polars as pl

N = 10_000_000
CUSTOMER_IDS = np.random.randint(10_000, 1_000_000, N)
SALES_DATES = np.random.randint(0, 1000, N)

df = (
    pl.DataFrame({"customer_id": CUSTOMER_IDS, "sales_date": SALES_DATES})
    .sort("sales_date")
    .with_columns(pl.col("sales_date").cast(pl.Date))
)

Before Parquet (fast, ~2 sec)

df.with_columns(
    time_between_sales=pl.col("sales_date")
    .diff()
    .over(
        partition_by="customer_id",
        order_by="sales_date",
    ),
)

Write Parquet & Read again

df.write_parquet("dummy.parquet")

df_dummy = pl.read_parquet("dummy.parquet")

NEW df from parquet (sloooooooow)

df_dummy.with_columns(
    time_between_sales=pl.col("sales_date")
    .diff()
    .over(
        partition_by="customer_id",
        order_by="sales_date",
    ),
)

NEW df from parquet with rechunk (fast again)

df_dummy.rechunk().with_columns( # >>>>>> rechunk
    time_between_sales=pl.col("sales_date")
    .diff()
    .over(
        partition_by="customer_id",
        order_by="sales_date",
    ),
)

deanm0000 · 2024-07-15T15:53:56Z

This is somewhat intentional since #16427

if you read_parquet(file, rechunk=True) then you'll get what you want.

or write_parquet(file, row_group_size=10_000_000)

ritchie46 · 2024-07-16T07:05:36Z

That we don't consolidate chunks by default is exprected. That we have such slow downs isn't. We should know where to rechunk internally. These problems have come to light lately now we don't rechunk by default anymore. Will fix it.

ritchie46 · 2024-07-16T07:08:13Z

Got a fix. The single chunk case now also is much faster.

Julian-J-S added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 15, 2024

ritchie46 mentioned this issue Jul 16, 2024

perf: Fix pathological perf issue in window-order-by #17650

Merged

ritchie46 closed this as completed in #17650 Jul 16, 2024

c-peters added the accepted Ready for implementation label Jul 22, 2024

c-peters assigned ritchie46 Jul 22, 2024

github-project-automation bot added this to Backlog Jul 22, 2024

c-peters moved this from Ready to Done in Backlog Jul 22, 2024

github-project-automation bot moved this to Ready in Backlog Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Calculation 200x+ slower without rechunking or sorting first #17637

Performance: Calculation 200x+ slower without rechunking or sorting first #17637

Julian-J-S commented Jul 15, 2024 •

edited

Loading

Julian-J-S commented Jul 15, 2024

deanm0000 commented Jul 15, 2024

ritchie46 commented Jul 16, 2024

ritchie46 commented Jul 16, 2024

Performance: Calculation 200x+ slower without rechunking or sorting first #17637

Performance: Calculation 200x+ slower without rechunking or sorting first #17637

Comments

Julian-J-S commented Jul 15, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Julian-J-S commented Jul 15, 2024

Reproducable Dataset:

Before Parquet (fast, ~2 sec)

Write Parquet & Read again

NEW df from parquet (sloooooooow)

NEW df from parquet with rechunk (fast again)

deanm0000 commented Jul 15, 2024

ritchie46 commented Jul 16, 2024

ritchie46 commented Jul 16, 2024

Julian-J-S commented Jul 15, 2024 •

edited

Loading