Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.groupby_slices() #9467

Closed
cmdlineluser opened this issue Jun 20, 2023 · 20 comments
Closed

.groupby_slices() #9467

cmdlineluser opened this issue Jun 20, 2023 · 20 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@cmdlineluser
Copy link
Contributor

Problem description

Inspired by https://stackoverflow.com/questions/76488314/polars-count-unique-values-over-a-time-period

Essentially I'm wondering if it would be possible to do something like:

(df.groupby_slices('start', 'end')
   .agg(arrival = 'arrival_time', departure = 'departure_time', truck_id = 'ID')
)

The original question was about counting within time windows, in which case the approach was to get the start/end slices.

df = pl.from_repr("""
┌───────┬─────┬──────────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┐
│ start ┆ end ┆ event_id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        │
│ ---   ┆ --- ┆ ---      ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 │
│ u32   ┆ u32 ┆ u32      ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        │
╞═══════╪═════╪══════════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╡
│ 0     ┆ 2   ┆ 0        ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 │
│ 0     ┆ 3   ┆ 1        ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 │
│ 1     ┆ 7   ┆ 2        ┆ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  ┆ 2023-01-01 06:29:20 ┆ 2023-01-01 06:36:01 │
│ 1     ┆ 7   ┆ 3        ┆ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  ┆ 2023-01-01 06:31:06 ┆ 2023-01-01 06:34:48 │
│ 1     ┆ 8   ┆ 4        ┆ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  ┆ 2023-01-01 06:32:09 ┆ 2023-01-01 06:37:01 │
│ 1     ┆ 9   ┆ 5        ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ 2023-01-01 06:33:08 ┆ 2023-01-01 06:40:49 │
│ 3     ┆ 9   ┆ 6        ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ 2023-01-01 06:35:40 ┆ 2023-01-01 06:39:34 │
│ 4     ┆ 9   ┆ 7        ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ 2023-01-01 06:36:43 ┆ 2023-01-01 06:41:48 │
│ 5     ┆ 9   ┆ 8        ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ 2023-01-01 06:38:48 ┆ 2023-01-01 06:47:10 │
└───────┴─────┴──────────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┘
""")

Expand via .arange(start, end) and into columns with list.to_struct()

(df.with_columns(pl.arange('start', 'end').list.to_struct(n_field_strategy='max_width'))
   .unnest('arange')
)

# shape: (9, 16)
# ┌───────┬─────┬──────────┬─────────────────────┬───┬─────────┬─────────┬─────────┬─────────┐
# │ start ┆ end ┆ event_id ┆ arrival_time        ┆ … ┆ field_4 ┆ field_5 ┆ field_6 ┆ field_7 │
# │ ---   ┆ --- ┆ ---      ┆ ---                 ┆   ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
# │ u32   ┆ u32 ┆ u32      ┆ datetime[μs]        ┆   ┆ i64     ┆ i64     ┆ i64     ┆ i64     │
# ╞═══════╪═════╪══════════╪═════════════════════╪═══╪═════════╪═════════╪═════════╪═════════╡
# │ 0     ┆ 2   ┆ 0        ┆ 2023-01-01 06:23:47 ┆ … ┆ null    ┆ null    ┆ null    ┆ null    │
# │ 0     ┆ 3   ┆ 1        ┆ 2023-01-01 06:26:42 ┆ … ┆ null    ┆ null    ┆ null    ┆ null    │
# │ 1     ┆ 7   ┆ 2        ┆ 2023-01-01 06:30:20 ┆ … ┆ 5       ┆ 6       ┆ null    ┆ null    │
# │ 1     ┆ 7   ┆ 3        ┆ 2023-01-01 06:32:06 ┆ … ┆ 5       ┆ 6       ┆ null    ┆ null    │
# │ 1     ┆ 8   ┆ 4        ┆ 2023-01-01 06:33:09 ┆ … ┆ 5       ┆ 6       ┆ 7       ┆ null    │
# │ 1     ┆ 9   ┆ 5        ┆ 2023-01-01 06:34:08 ┆ … ┆ 5       ┆ 6       ┆ 7       ┆ 8       │
# │ 3     ┆ 9   ┆ 6        ┆ 2023-01-01 06:36:40 ┆ … ┆ 7       ┆ 8       ┆ null    ┆ null    │
# │ 4     ┆ 9   ┆ 7        ┆ 2023-01-01 06:37:43 ┆ … ┆ 8       ┆ null    ┆ null    ┆ null    │
# │ 5     ┆ 9   ┆ 8        ┆ 2023-01-01 06:39:48 ┆ … ┆ null    ┆ null    ┆ null    ┆ null    │
# └───────┴─────┴──────────┴─────────────────────┴───┴─────────┴─────────┴─────────┴─────────┘

And use .take('^field_\d+$') to get the values

(df.with_columns(pl.arange('start', 'end').list.to_struct(n_field_strategy='max_width'))
   .unnest('arange')
   .with_columns(
      arrival   = pl.concat_list(pl.col('arrival_time').take('^field_\d+$')),
      departure = pl.concat_list(pl.col('departure_time').take('^field_\d+$')),
      truck_id  = pl.concat_list(pl.col('ID').take('^field_\d+$'))
   )
   .select(
      pl.exclude('start', 'end', '^field_\d+$')
   )
)

# shape: (9, 9)
# ┌──────────┬─────────────────────┬─────────────────────┬─────┬───┬─────────────────────┬───────────────────────────────────┬───────────────────────────────────┬──────────────────────┐
# │ event_id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ … ┆ window_close        ┆ arrival                           ┆ departure                         ┆ truck_id             │
# │ ---      ┆ ---                 ┆ ---                 ┆ --- ┆   ┆ ---                 ┆ ---                               ┆ ---                               ┆ ---                  │
# │ u32      ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆   ┆ datetime[μs]        ┆ list[datetime[μs]]                ┆ list[datetime[μs]]                ┆ list[str]            │
# ╞══════════╪═════════════════════╪═════════════════════╪═════╪═══╪═════════════════════╪═══════════════════════════════════╪═══════════════════════════════════╪══════════════════════╡
# │ 0        ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ … ┆ 2023-01-01 06:26:08 ┆ [2023-01-01 06:23:47, 2023-01-01… ┆ [2023-01-01 06:25:08, 2023-01-01… ┆ ["A1", "A1", … null] │
# │ 1        ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ … ┆ 2023-01-01 06:29:02 ┆ [2023-01-01 06:23:47, 2023-01-01… ┆ [2023-01-01 06:25:08, 2023-01-01… ┆ ["A1", "A1", … null] │
# │ 2        ┆ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  ┆ … ┆ 2023-01-01 06:36:01 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … null] │
# │ 3        ┆ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  ┆ … ┆ 2023-01-01 06:34:48 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … null] │
# │ 4        ┆ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  ┆ … ┆ 2023-01-01 06:37:01 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … null] │
# │ 5        ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ … ┆ 2023-01-01 06:40:49 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … "A6"] │
# │ 6        ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ … ┆ 2023-01-01 06:39:34 ┆ [2023-01-01 06:32:06, 2023-01-01… ┆ [2023-01-01 06:33:48, 2023-01-01… ┆ ["A6", "B3", … null] │
# │ 7        ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ … ┆ 2023-01-01 06:41:48 ┆ [2023-01-01 06:33:09, 2023-01-01… ┆ [2023-01-01 06:36:01, 2023-01-01… ┆ ["B3", "C3", … null] │
# │ 8        ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ … ┆ 2023-01-01 06:47:10 ┆ [2023-01-01 06:34:08, 2023-01-01… ┆ [2023-01-01 06:39:49, 2023-01-01… ┆ ["C3", "A6", … null] │
# └──────────┴─────────────────────┴─────────────────────┴─────┴───┴─────────────────────┴───────────────────────────────────┴───────────────────────────────────┴──────────────────────┘

Which seems to work okay and is decently fast (although it probably starts to suffer when the number of field_ columns starts to get sufficiently large.)

I'm not sure if there are any other potential use-cases for being able to specify slices?

It's also possible there are better ways to solve this type of problem?

Thanks all.

@cmdlineluser cmdlineluser added the enhancement New feature or an improvement of an existing feature label Jun 20, 2023
@avimallu
Copy link
Contributor

It's also possible there are better ways to solve this type of problem?

Are you restricted to using Polars exclusively? This looks like a non-equi join problem in SQL, where I would approach it with a self-join and two inequality conditions for row being within +/- 1 minute of itself.

@erinov1
Copy link

erinov1 commented Jun 20, 2023

Related issues (in addition to various requsts for non-equi joins): #5891, #7211.

For what it's worth, in my own applications I've found that if I just want aggregate statistics for each slice, then it is best to use polars to find the numerical indices for each slice, convert everything to numpy, then use numba to parallel-iterate (using prange) to compute statistics for each slice. This has been orders of magnitude faster than any pure-polars solution I've found or using non-equi joins in duckdb.

@cmdlineluser
Copy link
Contributor Author

Thanks @avimallu - there are no restrictions, I was mostly just curious about doing it in polars without a cross join.

Thanks @erinov1 - I should have remembered #5891 as I commented in there. Very interesting about the numba approach - do you have a code example handy? No worries if it's too much trouble - i'll experiment with it.

@erinov1
Copy link

erinov1 commented Jun 22, 2023

@cmdlineluser I have no idea if this is efficient for finding unique elements in a string column (I've been using it to compute statistics for numerical columns), but something like

import numpy as np
from numba import njit, prange

slices = df.select("start", "end").to_numpy()
ids = df.get_column("ID").to_numpy().astype(np.unicode_)

@njit(parallel=True)
def get_unique_count(slices, ids):
    unique = np.empty((slices.shape[0],))
    for n in prange(slices.shape[0]):
        start, end = slices[n]
        unique[n] = len(np.unique(ids[start:end]))
    return unique

get_unique_count(slices, ids)
# array([1., 2., 5., 5., 5., 5., 4., 4., 3.])

@cmdlineluser
Copy link
Contributor Author

Thanks a lot @erinov1

Yeah, doing anything with strings seems to slow it down quite a bit.

What I did do was .rank() the string column and use a "manual" n_unique implementation as np.unique() was quite slow - which gave massive speedups as you describe.

I had a look at that GroupsProxy you mentioned in the other issue and I modified the https://github.com/pola-rs/pyo3-polars example.

I have little rust knowledge, so this is just me taking things apart to try to get them to run:

use polars::prelude::*;

pub(super) fn groupby_slices(df: DataFrame, start: DataFrame, end: DataFrame) -> PolarsResult<DataFrame> {
    let start = start.column("start")?;
    let end = end.column("end")?;

    let sa = start.i64()?;
    let sb = end.i64()?;

    let groups = sa.into_iter().zip(sb.into_iter()).map(|(a, b)| {
        match (a, b) {
            (Some(a), Some(b)) => {
                let start = a as IdxSize;
                let end = b as IdxSize;
                let len = end - start as IdxSize;
                [start, len]
            },
            _ => [0, 0] // This is wrong. What to do here?
        }
    }).collect();

    let groups = GroupsProxy::Slice { groups, rolling: false };

    let out = GroupBy::new(&df, vec![], groups, None);
    //Ok(out.first()?)
    Ok(out.agg_list()?)
}

This pretty much returns the groups instantly, however it seems this Groupby::new() + .aggfunc() are deprecated.

I found some further clues in: https://github.com/pola-rs/polars/blob/main/polars/polars-lazy/src/dsl/list.rs#L113

I can't seem to figure out how to assign the groups without actually running an aggregation so just .lazy().groupby() can be returned back and then the .agg() can be called from python.

I'm a bit lost with this Arena/LogicalPlan stuff.

@avimallu
Copy link
Contributor

avimallu commented Jun 22, 2023

Thanks @avimallu - there are no restrictions, I was mostly just curious about doing it in polars without a cross join.

A very elegant (if I say so myself) solution in SQL using non-equi joins (eagerly waiting for them to land in Polars, more so after this question):

import polars as pl
data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

Followed by

See code
import duckdb as db
db.query("""
    SELECT
        A.arrival_time
        ,A.departure_time
        ,A.window_open
        ,A.window_close
        ,LIST_UNIQUE(LIST(B.ID))    AS docked_trucks
    FROM (
        SELECT
            *
            ,arrival_time   - (INTERVAL 1 MINUTE) AS window_open
            ,departure_time + (INTERVAL 1 MINUTE) AS window_close
        FROM data
    ) A
    LEFT JOIN (
        SELECT
            *
            ,DATEDIFF('seconds', arrival_time, departure_time) AS duration
        FROM data
    ) B
    ON (
        (B.arrival_time <= A.window_open AND (B.arrival_time   + TO_SECONDS(B.duration)) >= A.window_open) OR
        (B.arrival_time >= A.window_open AND                           B.departure_time  <= A.window_close) OR
        (B.arrival_time >= A.window_open AND (B.departure_time - TO_SECONDS(B.duration)) <= A.window_close)
    )
    GROUP BY 1, 2, 3, 4
""")
gives:
┌─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────────┐
│    arrival_timedeparture_timewindow_openwindow_closedocked_trucks │
│      timestamptimestamptimestamptimestampuint64     │
├─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────────────┤
│ 2023-01-01 06:23:472023-01-01 06:25:082023-01-01 06:22:472023-01-01 06:26:081 │
│ 2023-01-01 06:26:422023-01-01 06:28:022023-01-01 06:25:422023-01-01 06:29:021 │
│ 2023-01-01 06:30:202023-01-01 06:35:012023-01-01 06:29:202023-01-01 06:36:014 │
│ 2023-01-01 06:32:062023-01-01 06:33:482023-01-01 06:31:062023-01-01 06:34:484 │
│ 2023-01-01 06:33:092023-01-01 06:36:012023-01-01 06:32:092023-01-01 06:37:014 │
│ 2023-01-01 06:34:082023-01-01 06:39:492023-01-01 06:33:082023-01-01 06:40:494 │
│ 2023-01-01 06:36:402023-01-01 06:38:342023-01-01 06:35:402023-01-01 06:39:344 │
│ 2023-01-01 06:37:432023-01-01 06:40:482023-01-01 06:36:432023-01-01 06:41:483 │
│ 2023-01-01 06:39:482023-01-01 06:46:102023-01-01 06:38:482023-01-01 06:47:103 │
└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────────┘

The three join conditions can be more understood visually:

See code
overlap window        ================
case 1 (o)     ====
case 2 (x)         ======
case 3 (x)                 =======
case 4 (x)                         =======
case 5 (o)                                  =====

We only care about the three cases marked with x that overlap with the window we are looking to confirm against, so only three conditions out of the five need to be specified in the join.

This will work well with any Polars based pipeline because of the zero-copy integration that DuckDB has with it.

@cmdlineluser
Copy link
Contributor Author

cmdlineluser commented Jun 22, 2023

Thank you for the code example @avimallu \o/

I'm not sure if it's my system or not, but duckdb seems to take a bit of time with this operation.

If I increase the data a bit, it takes 12s

df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)

The polars approach from above takes 0.5s

import polars as pl
from   codetiming import Timer

df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)

with Timer():
    df = (
       df
         .with_columns(   
            window_open  = pl.col("arrival_time").dt.offset_by("-1m"),
            window_close = pl.col("departure_time").dt.offset_by("1m")
         )
        .sort("arrival_time")
        .with_columns(event_id = pl.arange(0, pl.count()))
    )

    start = (
       df.sort("window_open") 
         .join_asof(
            df.sort("departure_time").with_row_count("start"), 
            left_on  = "window_open",  
            right_on = "departure_time"
         )
         .select(pl.col("start").backward_fill())
    )

    end = (
       df.sort("window_close")
         .join_asof(
            df.sort("arrival_time").with_row_count("end"),   
            left_on  = "window_close", 
            right_on = "arrival_time", 
            strategy = "forward"
         )
         .select(pl.col("end").forward_fill() + 1)
    )

    (pl.concat([df, start, end], how="horizontal")
       .with_columns(
          pl.arange("start", "end").list.to_struct(n_field_strategy="max_width")
       )
       .unnest("arange")
       .with_columns(
          arrival   = pl.concat_list(pl.col("arrival_time").take("^field_\d+$")),
          departure = pl.concat_list(pl.col("departure_time").take("^field_\d+$")),
          truck_id  = pl.concat_list(pl.col("ID").take("^field_\d+$"))
       )
       .explode("arrival", "departure", "truck_id")
       .filter(
          pl.all(
             pl.col("window_open")  <= pl.col("departure"), 
             pl.col("window_close") >= pl.col("arrival")
          )
       )
       .groupby("window_open", "window_close")
       .agg(
          pl.col("event_id", "arrival_time", "departure_time", "ID").first(), 
          docked_trucks = pl.col("truck_id").n_unique()
       )
       .sort("event_id") # sort and re-order columns for output
       .select( 
           "event_id",
           "arrival_time",
           "departure_time",
           "ID",
           "window_open",
           "window_close",
           "docked_trucks",
       )
    )     

Going from 3000 to 6000 the difference increases 45s vs. 0.6s

@avimallu
Copy link
Contributor

avimallu commented Jun 22, 2023

Hmm, odd. It takes 0.4505s for me. I'm not able to run your code for Polars. The problem starts here:

pl.concat([df, start, end], how="horizontal")

What are start and end? I'll test on my system (M2 Mac, not Pro) and get back if you can update it. The code I ran for DuckDB's evaluation is:

See code
import polars as pl
import duckdb as db
from   codetiming import Timer

df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)
with Timer():
    db.query("""
        SELECT
            A.arrival_time
            ,A.departure_time
            ,A.window_open
            ,A.window_close
            ,LIST_UNIQUE(LIST(B.ID))    AS docked_trucks
        FROM (
            SELECT
                *
                ,arrival_time   - (INTERVAL 1 MINUTE) AS window_open
                ,departure_time + (INTERVAL 1 MINUTE) AS window_close
            FROM df
        ) A
        LEFT JOIN (
            SELECT
                *
                ,DATEDIFF('seconds', arrival_time, departure_time) AS duration
            FROM df
        ) B
        ON (
            (B.arrival_time <= A.window_open AND (B.arrival_time   + TO_SECONDS(B.duration)) >= A.window_open) OR
            (B.arrival_time >= A.window_open AND                           B.departure_time  <= A.window_close) OR
            (B.arrival_time >= A.window_open AND (B.departure_time - TO_SECONDS(B.duration)) <= A.window_close)
        )
        GROUP BY 1, 2, 3, 4
    """)

Am I missing something in the DuckDB code? By the way, DuckDB caches the query so it'll run much faster the second time - all my timings are therefore for the first time. My DuckDB version is 0.7.2-dev1898. Maybe that makes the difference?

EDIT

The newer version (0.8.1) is even faster: Elapsed time: 0.1335 seconds?!

@cmdlineluser
Copy link
Contributor Author

cmdlineluser commented Jun 22, 2023

I edited in start/end.

Thanks for the timings, I tried a reboot, I thought it may be due to my system being overloaded but I get the same results - I am also using 0.8.1.

I'm using an older laptop, so that may have something to with it - looking athtop - the duckdb version seems to be bouncing around 4 of 8 cores, but only using 1 at a time.

@avimallu
Copy link
Contributor

avimallu commented Jun 22, 2023

Thanks for the edit; you should probably raise an issue with the DuckDB team given the slower speed on your system. Here are my timings now (there is a significant variance in the timings per run).

DuckDB: Elapsed time: 0.1050 seconds
Polars: Elapsed time: 0.3392 seconds

Given the easier query, I'd probably stick to the former for now. Pretty impressive that Polars can reach that speed even without dedicated non-equi joins. Really excited for when they land 🤯.

@cmdlineluser
Copy link
Contributor Author

cmdlineluser commented Jun 22, 2023

I was in an interactive session. The slowdown seems to be related to the duckdb progress bar somehow.

Running the code in a standalone script gives the same results as you.

Oddly enough, if I add pragma disable_progress_bar; to the script it goes back to 12-13 seconds.

At least I have proper timings now, no more testing in an interactive session it seems, thank you.

@cmdlineluser
Copy link
Contributor Author

I realized that instead of going sideways by adding columns, it makes more sense to go lengthways.

df = df.lazy()
start = start.lazy()
end = end.lazy()

(
   df.select('event_id').with_context([start, end])
     .select('event_id', pl.arange('start', 'end'))
     .explode('arange')
     .with_context(df)
     .select(pl.all().take(pl.col('arange')), group = 'event_id')
     .collect()
)

# shape: (47, 8)
# ┌──────────┬────────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┐
# │ event_id ┆ arange ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ group │
# │ ---      ┆ ---    ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   │
# │ i64      ┆ i64    ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ i64   │
# ╞══════════╪════════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╡
# │ 0        ┆ 0      ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 0     │
# │ 0        ┆ 1      ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 0     │
# │ 0        ┆ 0      ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 1     │
# │ 0        ┆ 1      ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 1     │
# │ …        ┆ …      ┆ …                   ┆ …                   ┆ …   ┆ …                   ┆ …                   ┆ …     │
# │ 2        ┆ 1      ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ 2023-01-01 06:33:08 ┆ 2023-01-01 06:40:49 ┆ 8     │
# │ 2        ┆ 2      ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ 2023-01-01 06:35:40 ┆ 2023-01-01 06:39:34 ┆ 8     │
# │ 2        ┆ 3      ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ 2023-01-01 06:36:43 ┆ 2023-01-01 06:41:48 ┆ 8     │
# │ 2        ┆ 4      ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ 2023-01-01 06:38:48 ┆ 2023-01-01 06:47:10 ┆ 8     │
# └──────────┴────────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┴───────┘

So you actually create the rows for each group which you can then join.

df = df.lazy()

start = start.lazy()
end = end.lazy()

groups = (
   df.select('event_id').with_context([start, end])
     .select('event_id', pl.arange('start', 'end'))
     .explode('arange')
     .with_context(df)
     .select(pl.all().take(pl.col('arange')), group = 'event_id')
)

(df.join(groups, left_on='event_id', right_on='group', how='left')
   .filter(pl.all(
      pl.col("window_open")  <= pl.col("departure_time_right"),
      pl.col("window_close") >= pl.col("arrival_time_right")
   ))
   .groupby('window_open', 'window_close', maintain_order=True)
   .agg(
      pl.col('event_id', 'arrival_time', 'departure_time', 'ID').first(),
      docked_trucks = pl.n_unique('ID_right')
   )
   .collect()
)

@avimallu
Copy link
Contributor

avimallu commented Jun 28, 2023

I was able to brainstorm with a few folks and find a simpler approach on the SQL front:

See code
import polars as pl
import duckdb as db
data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

db.query("""
SELECT
    A.arrival_time
    ,A.departure_time
    ,A.arrival_time - (INTERVAL 1 MINUTE)   AS window_open
    ,A.departure_time + (INTERVAL 1 MINUTE) AS window_close
    ,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
    ,LIST_UNIQUE(LIST(B.ID))   AS docked_truck_count
FROM  data A, data B
WHERE B.arrival_time   <= window_close
AND   B.departure_time >= window_open
GROUP BY 1, 2, 3, 4
""").pl()

This reduces the number of conditions to just two, and the SQL optimizer engine automatically converts the cross join to an interval join, so this is probably the most readable query while still being just as fast (or faster).

@cmdlineluser
Copy link
Contributor Author

Thank you for the update @avimallu - that does seem like a nice improvement.

Some interesting things I discovered from experimenting with your example:

  • duckdb has GROUP BY ALL
  • duckdb has UFCS, so you can write: B.id.list().list_unique()

Pretty neat stuff.

I think I spotted where the real issue with the timing mismatch came from.

When getting 0.1s for the larger dataset with duckdb I had copied your code from: #9467 (comment)

It just has db.query() and doesn't do anything with the result so I don't think duckdb actually runs the full query.

Adding a print() or .pl() makes the query run and it then takes ~12 seconds for me.

Perhaps you can confirm this.

@avimallu
Copy link
Contributor

avimallu commented Jun 30, 2023

AFAIK, the query and sql methods execute the query without the need for a pl or pd call, with minor differences.

duckdb has UFCS, so you can write: B.id.list().list_unique()

This is very neat indeed. I have never noticed it being spoken about before. Crazy!

@cmdlineluser
Copy link
Contributor Author

@avimallu I've received access to an M1 Mac Pro and just installed a fresh polars and duckdb.

Using the code from #9467 (comment)

Elapsed time: 0.1485 seconds

Adding a print() around db.query()

Elapsed time: 4.9586 seconds

I've also tested writing the output from within duckdb e.g. COPY (...) TO 'out.csv' and it's taking 5 seconds.

It definitely looks like the 0.1s time is false and the query is not running fully in that case.

@avimallu
Copy link
Contributor

Good point. I think something happens with Timer that causes it to not evaluate normally with just a db.query, because running it outside of it does reduce the time taken. This code seems to take about 1.1s:

See code
import polars as pl
import duckdb as db
from   codetiming import Timer

df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")
df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)

with Timer():
    db.query("""
    SELECT
        A.arrival_time
        ,A.departure_time
        ,A.arrival_time - (INTERVAL 1 MINUTE)   AS window_open
        ,A.departure_time + (INTERVAL 1 MINUTE) AS window_close
        ,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
        ,LIST_UNIQUE(LIST(B.ID))   AS docked_truck_count
    FROM  df A, df B
    WHERE B.arrival_time   <= window_close
    AND   B.departure_time >= window_open
    GROUP BY 1, 2, 3, 4
    """).pl()

@cmdlineluser
Copy link
Contributor Author

Yeah, very odd - thank you for confirming.

Using the updated version, I get 1.3s

@cmdlineluser
Copy link
Contributor Author

I just realized that .groupby_slices wouldn't help with the current approach due to the .filter()

The above approach finds the "outermost windows" and then filters them.

It seems that to do it properly you would need to find the "inner windows" and do it separately for arrivals and departures.

There is also seems to be the issue of "non-crossing" windows: (there's definitely a technical term for this: overlapping/intersecting?)

df = pl.from_repr("""
shape: (12, 6)
┌─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┐
│ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        │
│ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 │
│ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        │
╞═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╡
│ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 │
│ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 │
│ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 │
│ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 │
│ 4   ┆ 2023-01-01 06:38:09 ┆ 2023-01-01 06:41:01 ┆ B3  ┆ 2023-01-01 06:37:09 ┆ 2023-01-01 06:42:01 │
│ 5   ┆ 2023-01-01 06:39:08 ┆ 2023-01-01 06:44:49 ┆ C3  ┆ 2023-01-01 06:38:08 ┆ 2023-01-01 06:45:49 │
│ 6   ┆ 2023-01-01 06:41:40 ┆ 2023-01-01 06:43:34 ┆ A6  ┆ 2023-01-01 06:40:40 ┆ 2023-01-01 06:44:34 │
│ 7   ┆ 2023-01-01 06:42:43 ┆ 2023-01-01 06:45:48 ┆ A5  ┆ 2023-01-01 06:41:43 ┆ 2023-01-01 06:46:48 │
│ 8   ┆ 2023-01-01 06:44:48 ┆ 2023-01-01 06:51:10 ┆ A6  ┆ 2023-01-01 06:43:48 ┆ 2023-01-01 06:52:10 │
│ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 │
│ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 │
│ 11  ┆ 2023-01-01 06:40:20 ┆ 2023-01-01 06:45:01 ┆ A5  ┆ 2023-01-01 06:39:20 ┆ 2023-01-01 06:46:01 │
└─────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┘
""")

arrive = (
   df.sort('arrival_time')
      .with_columns(
         start = pl.col('arrival_time').search_sorted('window_open'),
         end = pl.col('arrival_time').search_sorted('window_close'),
      )
      .with_row_count()
)

# shape: (12, 9)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┐
# │ row_nr ┆ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ start ┆ end │
# │ ---    ┆ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   ┆ --- │
# │ u32    ┆ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ u32 │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╡
# │ 0      ┆ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0     ┆ 1   │
# │ 1      ┆ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1     ┆ 3   │
# │ 2      ┆ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 2     ┆ 4   │
# │ 3      ┆ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 3     ┆ 10  │
# │ 4      ┆ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 4     ┆ 8   │ # <-
# │ 5      ┆ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 4     ┆ 9   │ # <-

Rows 4 and 5 here sort into position 4, however their windows are both fully contained inside the window of row 3 (but don't "touch")

This means:

  • row 3's window finds 4 and 5,
  • row 4 finds 5
  • row 5 finds nothing

But it should be 3, 4, 5 for each of them.

So it seems that grouping by the window this way cannot detect "parent windows", only when their boundaries actually "touch". (cross/intersect/overlap?)

It appears we can look back and check if we have a parent window:

previous = pl.when(pl.col('start') == 0).then(1).otherwise(pl.col('start')) - 1

is_sub_window = pl.all(
   pl.col('arrival_time', 'departure_time') <= pl.col('departure_time').take(previous)
)

arrive = (
   arrive.with_columns(
      parent_id = pl.when(is_sub_window).then(pl.col('_id').take(previous)),
      parent_departure_time =
         pl.when(is_sub_window).then(pl.col('departure_time').take(previous))
   )
)

# shape: (12, 11)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┬───────────┬───────────────────────┐
# │ row_nr ┆ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ start ┆ end ┆ parent_id ┆ parent_departure_time │
# │ ---    ┆ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   ┆ --- ┆ ---       ┆ ---                   │
# │ u32    ┆ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ u32 ┆ u32       ┆ datetime[μs]          │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╪═══════════╪═══════════════════════╡
# │ 0      ┆ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0     ┆ 1   ┆ 0         ┆ 2023-01-01 06:30:08   │
# │ 1      ┆ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1     ┆ 3   ┆ null      ┆ null                  │
# │ 2      ┆ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 2     ┆ 4   ┆ null      ┆ null                  │
# │ 3      ┆ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 3     ┆ 9   ┆ null      ┆ null                  │
# │ 4      ┆ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 4     ┆ 7   ┆ 2         ┆ 2023-01-01 06:40:01   │ # <-
# │ 5      ┆ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 4     ┆ 8   ┆ 2         ┆ 2023-01-01 06:40:01   │ # <-

We can then use the parent's departure time to ensure the windows "touch".

depart = (
   df.sort('departure_time')
     .with_columns(
        start = pl.col('departure_time').search_sorted('window_open'),
        end = pl.col('departure_time').search_sorted('window_close')
     )
    .with_row_count()
)

depart = (
   depart.join(
      arrive.drop_nulls().select('_id', 'parent_id', 'parent_departure_time'),
      on = '_id',
      how = 'left'
   )
   .with_columns(parent_end = 
      pl.col('departure_time').search_sorted('parent_departure_time', side='right'))
   .with_columns(end =
      pl.when(pl.col('parent_id').is_not_null())
        .then(pl.col('parent_end'))
        .otherwise(pl.col('end')),
      old_end = 'end'
   )
)

# shape: (12, 13)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┬───────────┬───────────────────────┬────────────┬─────────┐
# │ row_nr ┆ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ start ┆ end ┆ parent_id ┆ parent_departure_time ┆ parent_end ┆ old_end │
# │ ---    ┆ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   ┆ --- ┆ ---       ┆ ---                   ┆ ---        ┆ ---     │
# │ u32    ┆ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ u32 ┆ u32       ┆ datetime[μs]          ┆ u32        ┆ u32     │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╪═══════════╪═══════════════════════╪════════════╪═════════╡
# │ 0      ┆ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0     ┆ 1   ┆ 0         ┆ 2023-01-01 06:30:08   ┆ 1          ┆ 1       │
# │ 1      ┆ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1     ┆ 2   ┆ null      ┆ null                  ┆ 0          ┆ 2       │
# │ 2      ┆ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 1     ┆ 3   ┆ null      ┆ null                  ┆ 0          ┆ 3       │
# │ 3      ┆ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 3     ┆ 6   ┆ 2         ┆ 2023-01-01 06:40:01   ┆ 6          ┆ 5       │ # <-
# │ 4      ┆ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 3     ┆ 6   ┆ 2         ┆ 2023-01-01 06:40:01   ┆ 6          ┆ 5       │ # <-
# │ 5      ┆ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 2     ┆ 6   ┆ null      ┆ null                  ┆ 0          ┆ 6       │
# │ 6      ┆ 4   ┆ 2023-01-01 06:38:09 ┆ 2023-01-01 06:41:01 ┆ B3  ┆ 2023-01-01 06:37:09 ┆ 2023-01-01 06:42:01 ┆ 3     ┆ 7   ┆ null      ┆ null                  ┆ 0          ┆ 7       │
# │ 7      ┆ 6   ┆ 2023-01-01 06:41:40 ┆ 2023-01-01 06:43:34 ┆ A6  ┆ 2023-01-01 06:40:40 ┆ 2023-01-01 06:44:34 ┆ 6     ┆ 10  ┆ 11        ┆ 2023-01-01 06:45:01   ┆ 10         ┆ 8       │ # <-
# │ 8      ┆ 5   ┆ 2023-01-01 06:39:08 ┆ 2023-01-01 06:44:49 ┆ C3  ┆ 2023-01-01 06:38:08 ┆ 2023-01-01 06:45:49 ┆ 4     ┆ 11  ┆ null      ┆ null                  ┆ 0          ┆ 11      │
# │ 9      ┆ 11  ┆ 2023-01-01 06:40:20 ┆ 2023-01-01 06:45:01 ┆ A5  ┆ 2023-01-01 06:39:20 ┆ 2023-01-01 06:46:01 ┆ 5     ┆ 11  ┆ null      ┆ null                  ┆ 0          ┆ 11      │
# │ 10     ┆ 7   ┆ 2023-01-01 06:42:43 ┆ 2023-01-01 06:45:48 ┆ A5  ┆ 2023-01-01 06:41:43 ┆ 2023-01-01 06:46:48 ┆ 7     ┆ 11  ┆ null      ┆ null                  ┆ 0          ┆ 11      │
# │ 11     ┆ 8   ┆ 2023-01-01 06:44:48 ┆ 2023-01-01 06:51:10 ┆ A6  ┆ 2023-01-01 06:43:48 ┆ 2023-01-01 06:52:10 ┆ 8     ┆ 12  ┆ null      ┆ null                  ┆ 0          ┆ 12      │
# └────────┴─────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┴───────┴─────┴───────────┴───────────────────────┴────────────┴─────────┘

These adjusted start/end positions now give the exact group slice.

Testing it out with a few different inputs generates the same output as the earlier polars and duckdb attempts so it seems to be equivalent (unless I've missed something that doesn't show up in this specific example data).

Although @areeh's approach of using actual times instead of slicing row numbers makes much more sense #9691

@cmdlineluser
Copy link
Contributor Author

Closing in favour of #9691 - however, it seems this will all be superseded by non-equi joins #10068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants