`.groupby_slices()` #9467

cmdlineluser · 2023-06-20T15:04:01Z

Problem description

Inspired by https://stackoverflow.com/questions/76488314/polars-count-unique-values-over-a-time-period

Essentially I'm wondering if it would be possible to do something like:

(df.groupby_slices('start', 'end')
   .agg(arrival = 'arrival_time', departure = 'departure_time', truck_id = 'ID')
)

The original question was about counting within time windows, in which case the approach was to get the start/end slices.

df = pl.from_repr("""
┌───────┬─────┬──────────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┐
│ start ┆ end ┆ event_id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        │
│ ---   ┆ --- ┆ ---      ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 │
│ u32   ┆ u32 ┆ u32      ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        │
╞═══════╪═════╪══════════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╡
│ 0     ┆ 2   ┆ 0        ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 │
│ 0     ┆ 3   ┆ 1        ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 │
│ 1     ┆ 7   ┆ 2        ┆ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  ┆ 2023-01-01 06:29:20 ┆ 2023-01-01 06:36:01 │
│ 1     ┆ 7   ┆ 3        ┆ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  ┆ 2023-01-01 06:31:06 ┆ 2023-01-01 06:34:48 │
│ 1     ┆ 8   ┆ 4        ┆ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  ┆ 2023-01-01 06:32:09 ┆ 2023-01-01 06:37:01 │
│ 1     ┆ 9   ┆ 5        ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ 2023-01-01 06:33:08 ┆ 2023-01-01 06:40:49 │
│ 3     ┆ 9   ┆ 6        ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ 2023-01-01 06:35:40 ┆ 2023-01-01 06:39:34 │
│ 4     ┆ 9   ┆ 7        ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ 2023-01-01 06:36:43 ┆ 2023-01-01 06:41:48 │
│ 5     ┆ 9   ┆ 8        ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ 2023-01-01 06:38:48 ┆ 2023-01-01 06:47:10 │
└───────┴─────┴──────────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┘
""")

Expand via .arange(start, end) and into columns with list.to_struct()

(df.with_columns(pl.arange('start', 'end').list.to_struct(n_field_strategy='max_width'))
   .unnest('arange')
)

# shape: (9, 16)
# ┌───────┬─────┬──────────┬─────────────────────┬───┬─────────┬─────────┬─────────┬─────────┐
# │ start ┆ end ┆ event_id ┆ arrival_time        ┆ … ┆ field_4 ┆ field_5 ┆ field_6 ┆ field_7 │
# │ ---   ┆ --- ┆ ---      ┆ ---                 ┆   ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
# │ u32   ┆ u32 ┆ u32      ┆ datetime[μs]        ┆   ┆ i64     ┆ i64     ┆ i64     ┆ i64     │
# ╞═══════╪═════╪══════════╪═════════════════════╪═══╪═════════╪═════════╪═════════╪═════════╡
# │ 0     ┆ 2   ┆ 0        ┆ 2023-01-01 06:23:47 ┆ … ┆ null    ┆ null    ┆ null    ┆ null    │
# │ 0     ┆ 3   ┆ 1        ┆ 2023-01-01 06:26:42 ┆ … ┆ null    ┆ null    ┆ null    ┆ null    │
# │ 1     ┆ 7   ┆ 2        ┆ 2023-01-01 06:30:20 ┆ … ┆ 5       ┆ 6       ┆ null    ┆ null    │
# │ 1     ┆ 7   ┆ 3        ┆ 2023-01-01 06:32:06 ┆ … ┆ 5       ┆ 6       ┆ null    ┆ null    │
# │ 1     ┆ 8   ┆ 4        ┆ 2023-01-01 06:33:09 ┆ … ┆ 5       ┆ 6       ┆ 7       ┆ null    │
# │ 1     ┆ 9   ┆ 5        ┆ 2023-01-01 06:34:08 ┆ … ┆ 5       ┆ 6       ┆ 7       ┆ 8       │
# │ 3     ┆ 9   ┆ 6        ┆ 2023-01-01 06:36:40 ┆ … ┆ 7       ┆ 8       ┆ null    ┆ null    │
# │ 4     ┆ 9   ┆ 7        ┆ 2023-01-01 06:37:43 ┆ … ┆ 8       ┆ null    ┆ null    ┆ null    │
# │ 5     ┆ 9   ┆ 8        ┆ 2023-01-01 06:39:48 ┆ … ┆ null    ┆ null    ┆ null    ┆ null    │
# └───────┴─────┴──────────┴─────────────────────┴───┴─────────┴─────────┴─────────┴─────────┘

And use .take('^field_\d+$') to get the values

(df.with_columns(pl.arange('start', 'end').list.to_struct(n_field_strategy='max_width'))
   .unnest('arange')
   .with_columns(
      arrival   = pl.concat_list(pl.col('arrival_time').take('^field_\d+$')),
      departure = pl.concat_list(pl.col('departure_time').take('^field_\d+$')),
      truck_id  = pl.concat_list(pl.col('ID').take('^field_\d+$'))
   )
   .select(
      pl.exclude('start', 'end', '^field_\d+$')
   )
)

# shape: (9, 9)
# ┌──────────┬─────────────────────┬─────────────────────┬─────┬───┬─────────────────────┬───────────────────────────────────┬───────────────────────────────────┬──────────────────────┐
# │ event_id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ … ┆ window_close        ┆ arrival                           ┆ departure                         ┆ truck_id             │
# │ ---      ┆ ---                 ┆ ---                 ┆ --- ┆   ┆ ---                 ┆ ---                               ┆ ---                               ┆ ---                  │
# │ u32      ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆   ┆ datetime[μs]        ┆ list[datetime[μs]]                ┆ list[datetime[μs]]                ┆ list[str]            │
# ╞══════════╪═════════════════════╪═════════════════════╪═════╪═══╪═════════════════════╪═══════════════════════════════════╪═══════════════════════════════════╪══════════════════════╡
# │ 0        ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ … ┆ 2023-01-01 06:26:08 ┆ [2023-01-01 06:23:47, 2023-01-01… ┆ [2023-01-01 06:25:08, 2023-01-01… ┆ ["A1", "A1", … null] │
# │ 1        ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ … ┆ 2023-01-01 06:29:02 ┆ [2023-01-01 06:23:47, 2023-01-01… ┆ [2023-01-01 06:25:08, 2023-01-01… ┆ ["A1", "A1", … null] │
# │ 2        ┆ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  ┆ … ┆ 2023-01-01 06:36:01 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … null] │
# │ 3        ┆ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  ┆ … ┆ 2023-01-01 06:34:48 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … null] │
# │ 4        ┆ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  ┆ … ┆ 2023-01-01 06:37:01 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … null] │
# │ 5        ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ … ┆ 2023-01-01 06:40:49 ┆ [2023-01-01 06:26:42, 2023-01-01… ┆ [2023-01-01 06:28:02, 2023-01-01… ┆ ["A1", "A5", … "A6"] │
# │ 6        ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ … ┆ 2023-01-01 06:39:34 ┆ [2023-01-01 06:32:06, 2023-01-01… ┆ [2023-01-01 06:33:48, 2023-01-01… ┆ ["A6", "B3", … null] │
# │ 7        ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ … ┆ 2023-01-01 06:41:48 ┆ [2023-01-01 06:33:09, 2023-01-01… ┆ [2023-01-01 06:36:01, 2023-01-01… ┆ ["B3", "C3", … null] │
# │ 8        ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ … ┆ 2023-01-01 06:47:10 ┆ [2023-01-01 06:34:08, 2023-01-01… ┆ [2023-01-01 06:39:49, 2023-01-01… ┆ ["C3", "A6", … null] │
# └──────────┴─────────────────────┴─────────────────────┴─────┴───┴─────────────────────┴───────────────────────────────────┴───────────────────────────────────┴──────────────────────┘

Which seems to work okay and is decently fast (although it probably starts to suffer when the number of field_ columns starts to get sufficiently large.)

I'm not sure if there are any other potential use-cases for being able to specify slices?

It's also possible there are better ways to solve this type of problem?

Thanks all.

The text was updated successfully, but these errors were encountered:

avimallu · 2023-06-20T16:33:37Z

It's also possible there are better ways to solve this type of problem?

Are you restricted to using Polars exclusively? This looks like a non-equi join problem in SQL, where I would approach it with a self-join and two inequality conditions for row being within +/- 1 minute of itself.

erinov1 · 2023-06-20T21:33:18Z

Related issues (in addition to various requsts for non-equi joins): #5891, #7211.

For what it's worth, in my own applications I've found that if I just want aggregate statistics for each slice, then it is best to use polars to find the numerical indices for each slice, convert everything to numpy, then use numba to parallel-iterate (using prange) to compute statistics for each slice. This has been orders of magnitude faster than any pure-polars solution I've found or using non-equi joins in duckdb.

cmdlineluser · 2023-06-21T16:03:04Z

Thanks @avimallu - there are no restrictions, I was mostly just curious about doing it in polars without a cross join.

Thanks @erinov1 - I should have remembered #5891 as I commented in there. Very interesting about the numba approach - do you have a code example handy? No worries if it's too much trouble - i'll experiment with it.

erinov1 · 2023-06-22T14:14:51Z

@cmdlineluser I have no idea if this is efficient for finding unique elements in a string column (I've been using it to compute statistics for numerical columns), but something like

import numpy as np
from numba import njit, prange

slices = df.select("start", "end").to_numpy()
ids = df.get_column("ID").to_numpy().astype(np.unicode_)

@njit(parallel=True)
def get_unique_count(slices, ids):
    unique = np.empty((slices.shape[0],))
    for n in prange(slices.shape[0]):
        start, end = slices[n]
        unique[n] = len(np.unique(ids[start:end]))
    return unique

get_unique_count(slices, ids)
# array([1., 2., 5., 5., 5., 5., 4., 4., 3.])

cmdlineluser · 2023-06-22T16:26:55Z

Thanks a lot @erinov1

Yeah, doing anything with strings seems to slow it down quite a bit.

What I did do was .rank() the string column and use a "manual" n_unique implementation as np.unique() was quite slow - which gave massive speedups as you describe.

I had a look at that GroupsProxy you mentioned in the other issue and I modified the https://github.com/pola-rs/pyo3-polars example.

I have little rust knowledge, so this is just me taking things apart to try to get them to run:

use polars::prelude::*;

pub(super) fn groupby_slices(df: DataFrame, start: DataFrame, end: DataFrame) -> PolarsResult<DataFrame> {
    let start = start.column("start")?;
    let end = end.column("end")?;

    let sa = start.i64()?;
    let sb = end.i64()?;

    let groups = sa.into_iter().zip(sb.into_iter()).map(|(a, b)| {
        match (a, b) {
            (Some(a), Some(b)) => {
                let start = a as IdxSize;
                let end = b as IdxSize;
                let len = end - start as IdxSize;
                [start, len]
            },
            _ => [0, 0] // This is wrong. What to do here?
        }
    }).collect();

    let groups = GroupsProxy::Slice { groups, rolling: false };

    let out = GroupBy::new(&df, vec![], groups, None);
    //Ok(out.first()?)
    Ok(out.agg_list()?)
}

This pretty much returns the groups instantly, however it seems this Groupby::new() + .aggfunc() are deprecated.

I found some further clues in: https://github.com/pola-rs/polars/blob/main/polars/polars-lazy/src/dsl/list.rs#L113

I can't seem to figure out how to assign the groups without actually running an aggregation so just .lazy().groupby() can be returned back and then the .agg() can be called from python.

I'm a bit lost with this Arena/LogicalPlan stuff.

avimallu · 2023-06-22T17:47:57Z

Thanks @avimallu - there are no restrictions, I was mostly just curious about doing it in polars without a cross join.

A very elegant (if I say so myself) solution in SQL using non-equi joins (eagerly waiting for them to land in Polars, more so after this question):

import polars as pl
data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

Followed by

See code

import duckdb as db
db.query("""
    SELECT
        A.arrival_time
        ,A.departure_time
        ,A.window_open
        ,A.window_close
        ,LIST_UNIQUE(LIST(B.ID))    AS docked_trucks
    FROM (
        SELECT
            *
            ,arrival_time   - (INTERVAL 1 MINUTE) AS window_open
            ,departure_time + (INTERVAL 1 MINUTE) AS window_close
        FROM data
    ) A
    LEFT JOIN (
        SELECT
            *
            ,DATEDIFF('seconds', arrival_time, departure_time) AS duration
        FROM data
    ) B
    ON (
        (B.arrival_time <= A.window_open AND (B.arrival_time   + TO_SECONDS(B.duration)) >= A.window_open) OR
        (B.arrival_time >= A.window_open AND                           B.departure_time  <= A.window_close) OR
        (B.arrival_time >= A.window_open AND (B.departure_time - TO_SECONDS(B.duration)) <= A.window_close)
    )
    GROUP BY 1, 2, 3, 4
""")

gives:

┌─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────────┐
│    arrival_time     │   departure_time    │     window_open     │    window_close     │ docked_trucks │
│      timestamp      │      timestamp      │      timestamp      │      timestamp      │    uint64     │
├─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────────────┤
│ 2023-01-01 06:23:47 │ 2023-01-01 06:25:08 │ 2023-01-01 06:22:47 │ 2023-01-01 06:26:08 │             1 │
│ 2023-01-01 06:26:42 │ 2023-01-01 06:28:02 │ 2023-01-01 06:25:42 │ 2023-01-01 06:29:02 │             1 │
│ 2023-01-01 06:30:20 │ 2023-01-01 06:35:01 │ 2023-01-01 06:29:20 │ 2023-01-01 06:36:01 │             4 │
│ 2023-01-01 06:32:06 │ 2023-01-01 06:33:48 │ 2023-01-01 06:31:06 │ 2023-01-01 06:34:48 │             4 │
│ 2023-01-01 06:33:09 │ 2023-01-01 06:36:01 │ 2023-01-01 06:32:09 │ 2023-01-01 06:37:01 │             4 │
│ 2023-01-01 06:34:08 │ 2023-01-01 06:39:49 │ 2023-01-01 06:33:08 │ 2023-01-01 06:40:49 │             4 │
│ 2023-01-01 06:36:40 │ 2023-01-01 06:38:34 │ 2023-01-01 06:35:40 │ 2023-01-01 06:39:34 │             4 │
│ 2023-01-01 06:37:43 │ 2023-01-01 06:40:48 │ 2023-01-01 06:36:43 │ 2023-01-01 06:41:48 │             3 │
│ 2023-01-01 06:39:48 │ 2023-01-01 06:46:10 │ 2023-01-01 06:38:48 │ 2023-01-01 06:47:10 │             3 │
└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────────┘

The three join conditions can be more understood visually:

See code

overlap window        ================
case 1 (o)     ====
case 2 (x)         ======
case 3 (x)                 =======
case 4 (x)                         =======
case 5 (o)                                  =====

We only care about the three cases marked with x that overlap with the window we are looking to confirm against, so only three conditions out of the five need to be specified in the join.

This will work well with any Polars based pipeline because of the zero-copy integration that DuckDB has with it.

cmdlineluser · 2023-06-22T18:54:04Z

Thank you for the code example @avimallu \o/

I'm not sure if it's my system or not, but duckdb seems to take a bit of time with this operation.

If I increase the data a bit, it takes 12s

df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)

The polars approach from above takes 0.5s

import polars as pl
from   codetiming import Timer

df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)

with Timer():
    df = (
       df
         .with_columns(   
            window_open  = pl.col("arrival_time").dt.offset_by("-1m"),
            window_close = pl.col("departure_time").dt.offset_by("1m")
         )
        .sort("arrival_time")
        .with_columns(event_id = pl.arange(0, pl.count()))
    )

    start = (
       df.sort("window_open") 
         .join_asof(
            df.sort("departure_time").with_row_count("start"), 
            left_on  = "window_open",  
            right_on = "departure_time"
         )
         .select(pl.col("start").backward_fill())
    )

    end = (
       df.sort("window_close")
         .join_asof(
            df.sort("arrival_time").with_row_count("end"),   
            left_on  = "window_close", 
            right_on = "arrival_time", 
            strategy = "forward"
         )
         .select(pl.col("end").forward_fill() + 1)
    )

    (pl.concat([df, start, end], how="horizontal")
       .with_columns(
          pl.arange("start", "end").list.to_struct(n_field_strategy="max_width")
       )
       .unnest("arange")
       .with_columns(
          arrival   = pl.concat_list(pl.col("arrival_time").take("^field_\d+$")),
          departure = pl.concat_list(pl.col("departure_time").take("^field_\d+$")),
          truck_id  = pl.concat_list(pl.col("ID").take("^field_\d+$"))
       )
       .explode("arrival", "departure", "truck_id")
       .filter(
          pl.all(
             pl.col("window_open")  <= pl.col("departure"), 
             pl.col("window_close") >= pl.col("arrival")
          )
       )
       .groupby("window_open", "window_close")
       .agg(
          pl.col("event_id", "arrival_time", "departure_time", "ID").first(), 
          docked_trucks = pl.col("truck_id").n_unique()
       )
       .sort("event_id") # sort and re-order columns for output
       .select( 
           "event_id",
           "arrival_time",
           "departure_time",
           "ID",
           "window_open",
           "window_close",
           "docked_trucks",
       )
    )

Going from 3000 to 6000 the difference increases 45s vs. 0.6s

avimallu · 2023-06-22T19:01:49Z

Hmm, odd. It takes 0.4505s for me. I'm not able to run your code for Polars. The problem starts here:

pl.concat([df, start, end], how="horizontal")

What are start and end? I'll test on my system (M2 Mac, not Pro) and get back if you can update it. The code I ran for DuckDB's evaluation is:

See code

import polars as pl
import duckdb as db
from   codetiming import Timer

df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)
with Timer():
    db.query("""
        SELECT
            A.arrival_time
            ,A.departure_time
            ,A.window_open
            ,A.window_close
            ,LIST_UNIQUE(LIST(B.ID))    AS docked_trucks
        FROM (
            SELECT
                *
                ,arrival_time   - (INTERVAL 1 MINUTE) AS window_open
                ,departure_time + (INTERVAL 1 MINUTE) AS window_close
            FROM df
        ) A
        LEFT JOIN (
            SELECT
                *
                ,DATEDIFF('seconds', arrival_time, departure_time) AS duration
            FROM df
        ) B
        ON (
            (B.arrival_time <= A.window_open AND (B.arrival_time   + TO_SECONDS(B.duration)) >= A.window_open) OR
            (B.arrival_time >= A.window_open AND                           B.departure_time  <= A.window_close) OR
            (B.arrival_time >= A.window_open AND (B.departure_time - TO_SECONDS(B.duration)) <= A.window_close)
        )
        GROUP BY 1, 2, 3, 4
    """)

Am I missing something in the DuckDB code? By the way, DuckDB caches the query so it'll run much faster the second time - all my timings are therefore for the first time. My DuckDB version is 0.7.2-dev1898. Maybe that makes the difference?

EDIT

The newer version (0.8.1) is even faster: Elapsed time: 0.1335 seconds?!

cmdlineluser · 2023-06-22T19:25:30Z

I edited in start/end.

Thanks for the timings, I tried a reboot, I thought it may be due to my system being overloaded but I get the same results - I am also using 0.8.1.

I'm using an older laptop, so that may have something to with it - looking athtop - the duckdb version seems to be bouncing around 4 of 8 cores, but only using 1 at a time.

avimallu · 2023-06-22T19:31:42Z

Thanks for the edit; you should probably raise an issue with the DuckDB team given the slower speed on your system. Here are my timings now (there is a significant variance in the timings per run).

DuckDB: Elapsed time: 0.1050 seconds
Polars: Elapsed time: 0.3392 seconds

Given the easier query, I'd probably stick to the former for now. Pretty impressive that Polars can reach that speed even without dedicated non-equi joins. Really excited for when they land 🤯.

cmdlineluser · 2023-06-22T20:47:18Z

I was in an interactive session. The slowdown seems to be related to the duckdb progress bar somehow.

Running the code in a standalone script gives the same results as you.

Oddly enough, if I add pragma disable_progress_bar; to the script it goes back to 12-13 seconds.

At least I have proper timings now, no more testing in an interactive session it seems, thank you.

cmdlineluser · 2023-06-23T16:29:52Z

I realized that instead of going sideways by adding columns, it makes more sense to go lengthways.

df = df.lazy()
start = start.lazy()
end = end.lazy()

(
   df.select('event_id').with_context([start, end])
     .select('event_id', pl.arange('start', 'end'))
     .explode('arange')
     .with_context(df)
     .select(pl.all().take(pl.col('arange')), group = 'event_id')
     .collect()
)

# shape: (47, 8)
# ┌──────────┬────────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┐
# │ event_id ┆ arange ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ group │
# │ ---      ┆ ---    ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   │
# │ i64      ┆ i64    ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ i64   │
# ╞══════════╪════════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╡
# │ 0        ┆ 0      ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 0     │
# │ 0        ┆ 1      ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 0     │
# │ 0        ┆ 0      ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 1     │
# │ 0        ┆ 1      ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 1     │
# │ …        ┆ …      ┆ …                   ┆ …                   ┆ …   ┆ …                   ┆ …                   ┆ …     │
# │ 2        ┆ 1      ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  ┆ 2023-01-01 06:33:08 ┆ 2023-01-01 06:40:49 ┆ 8     │
# │ 2        ┆ 2      ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  ┆ 2023-01-01 06:35:40 ┆ 2023-01-01 06:39:34 ┆ 8     │
# │ 2        ┆ 3      ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  ┆ 2023-01-01 06:36:43 ┆ 2023-01-01 06:41:48 ┆ 8     │
# │ 2        ┆ 4      ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  ┆ 2023-01-01 06:38:48 ┆ 2023-01-01 06:47:10 ┆ 8     │
# └──────────┴────────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┴───────┘

So you actually create the rows for each group which you can then join.

df = df.lazy()

start = start.lazy()
end = end.lazy()

groups = (
   df.select('event_id').with_context([start, end])
     .select('event_id', pl.arange('start', 'end'))
     .explode('arange')
     .with_context(df)
     .select(pl.all().take(pl.col('arange')), group = 'event_id')
)

(df.join(groups, left_on='event_id', right_on='group', how='left')
   .filter(pl.all(
      pl.col("window_open")  <= pl.col("departure_time_right"),
      pl.col("window_close") >= pl.col("arrival_time_right")
   ))
   .groupby('window_open', 'window_close', maintain_order=True)
   .agg(
      pl.col('event_id', 'arrival_time', 'departure_time', 'ID').first(),
      docked_trucks = pl.n_unique('ID_right')
   )
   .collect()
)

avimallu · 2023-06-28T20:40:17Z

I was able to brainstorm with a few folks and find a simpler approach on the SQL front:

See code

import polars as pl
import duckdb as db
data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")

db.query("""
SELECT
    A.arrival_time
    ,A.departure_time
    ,A.arrival_time - (INTERVAL 1 MINUTE)   AS window_open
    ,A.departure_time + (INTERVAL 1 MINUTE) AS window_close
    ,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
    ,LIST_UNIQUE(LIST(B.ID))   AS docked_truck_count
FROM  data A, data B
WHERE B.arrival_time   <= window_close
AND   B.departure_time >= window_open
GROUP BY 1, 2, 3, 4
""").pl()

This reduces the number of conditions to just two, and the SQL optimizer engine automatically converts the cross join to an interval join, so this is probably the most readable query while still being just as fast (or faster).

cmdlineluser · 2023-06-30T06:46:07Z

Thank you for the update @avimallu - that does seem like a nice improvement.

Some interesting things I discovered from experimenting with your example:

duckdb has GROUP BY ALL
duckdb has UFCS, so you can write: B.id.list().list_unique()

Pretty neat stuff.

I think I spotted where the real issue with the timing mismatch came from.

When getting 0.1s for the larger dataset with duckdb I had copied your code from: #9467 (comment)

It just has db.query() and doesn't do anything with the result so I don't think duckdb actually runs the full query.

Adding a print() or .pl() makes the query run and it then takes ~12 seconds for me.

Perhaps you can confirm this.

avimallu · 2023-06-30T13:42:45Z

AFAIK, the query and sql methods execute the query without the need for a pl or pd call, with minor differences.

duckdb has UFCS, so you can write: B.id.list().list_unique()

This is very neat indeed. I have never noticed it being spoken about before. Crazy!

cmdlineluser · 2023-06-30T20:46:55Z

@avimallu I've received access to an M1 Mac Pro and just installed a fresh polars and duckdb.

Using the code from #9467 (comment)

Elapsed time: 0.1485 seconds

Adding a print() around db.query()

Elapsed time: 4.9586 seconds

I've also tested writing the output from within duckdb e.g. COPY (...) TO 'out.csv' and it's taking 5 seconds.

It definitely looks like the 0.1s time is false and the query is not running fully in that case.

avimallu · 2023-06-30T21:00:51Z

Good point. I think something happens with Timer that causes it to not evaluate normally with just a db.query, because running it outside of it does reduce the time taken. This code seems to take about 1.1s:

See code

import polars as pl
import duckdb as db
from   codetiming import Timer

df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time        ┆ departure_time      ┆ ID  │
│ ---                 ┆ ---                 ┆ --- │
│ datetime[μs]        ┆ datetime[μs]        ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1  │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1  │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5  │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6  │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3  │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3  │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6  │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5  │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6  │
└─────────────────────┴─────────────────────┴─────┘
""")
df = pl.concat(
   df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m')) 
   for N in range(1, 3000)
)

with Timer():
    db.query("""
    SELECT
        A.arrival_time
        ,A.departure_time
        ,A.arrival_time - (INTERVAL 1 MINUTE)   AS window_open
        ,A.departure_time + (INTERVAL 1 MINUTE) AS window_close
        ,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
        ,LIST_UNIQUE(LIST(B.ID))   AS docked_truck_count
    FROM  df A, df B
    WHERE B.arrival_time   <= window_close
    AND   B.departure_time >= window_open
    GROUP BY 1, 2, 3, 4
    """).pl()

cmdlineluser · 2023-06-30T21:55:27Z

Yeah, very odd - thank you for confirming.

Using the updated version, I get 1.3s

cmdlineluser · 2023-07-06T03:06:04Z

I just realized that .groupby_slices wouldn't help with the current approach due to the .filter()

The above approach finds the "outermost windows" and then filters them.

It seems that to do it properly you would need to find the "inner windows" and do it separately for arrivals and departures.

There is also seems to be the issue of "non-crossing" windows: (there's definitely a technical term for this: overlapping/intersecting?)

df = pl.from_repr("""
shape: (12, 6)
┌─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┐
│ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        │
│ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 │
│ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        │
╞═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╡
│ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 │
│ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 │
│ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 │
│ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 │
│ 4   ┆ 2023-01-01 06:38:09 ┆ 2023-01-01 06:41:01 ┆ B3  ┆ 2023-01-01 06:37:09 ┆ 2023-01-01 06:42:01 │
│ 5   ┆ 2023-01-01 06:39:08 ┆ 2023-01-01 06:44:49 ┆ C3  ┆ 2023-01-01 06:38:08 ┆ 2023-01-01 06:45:49 │
│ 6   ┆ 2023-01-01 06:41:40 ┆ 2023-01-01 06:43:34 ┆ A6  ┆ 2023-01-01 06:40:40 ┆ 2023-01-01 06:44:34 │
│ 7   ┆ 2023-01-01 06:42:43 ┆ 2023-01-01 06:45:48 ┆ A5  ┆ 2023-01-01 06:41:43 ┆ 2023-01-01 06:46:48 │
│ 8   ┆ 2023-01-01 06:44:48 ┆ 2023-01-01 06:51:10 ┆ A6  ┆ 2023-01-01 06:43:48 ┆ 2023-01-01 06:52:10 │
│ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 │
│ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 │
│ 11  ┆ 2023-01-01 06:40:20 ┆ 2023-01-01 06:45:01 ┆ A5  ┆ 2023-01-01 06:39:20 ┆ 2023-01-01 06:46:01 │
└─────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┘
""")

arrive = (
   df.sort('arrival_time')
      .with_columns(
         start = pl.col('arrival_time').search_sorted('window_open'),
         end = pl.col('arrival_time').search_sorted('window_close'),
      )
      .with_row_count()
)

# shape: (12, 9)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┐
# │ row_nr ┆ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ start ┆ end │
# │ ---    ┆ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   ┆ --- │
# │ u32    ┆ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ u32 │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╡
# │ 0      ┆ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0     ┆ 1   │
# │ 1      ┆ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1     ┆ 3   │
# │ 2      ┆ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 2     ┆ 4   │
# │ 3      ┆ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 3     ┆ 10  │
# │ 4      ┆ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 4     ┆ 8   │ # <-
# │ 5      ┆ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 4     ┆ 9   │ # <-

Rows 4 and 5 here sort into position 4, however their windows are both fully contained inside the window of row 3 (but don't "touch")

This means:

row 3's window finds 4 and 5,
row 4 finds 5
row 5 finds nothing

But it should be 3, 4, 5 for each of them.

So it seems that grouping by the window this way cannot detect "parent windows", only when their boundaries actually "touch". (cross/intersect/overlap?)

It appears we can look back and check if we have a parent window:

previous = pl.when(pl.col('start') == 0).then(1).otherwise(pl.col('start')) - 1

is_sub_window = pl.all(
   pl.col('arrival_time', 'departure_time') <= pl.col('departure_time').take(previous)
)

arrive = (
   arrive.with_columns(
      parent_id = pl.when(is_sub_window).then(pl.col('_id').take(previous)),
      parent_departure_time =
         pl.when(is_sub_window).then(pl.col('departure_time').take(previous))
   )
)

# shape: (12, 11)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┬───────────┬───────────────────────┐
# │ row_nr ┆ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ start ┆ end ┆ parent_id ┆ parent_departure_time │
# │ ---    ┆ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   ┆ --- ┆ ---       ┆ ---                   │
# │ u32    ┆ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ u32 ┆ u32       ┆ datetime[μs]          │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╪═══════════╪═══════════════════════╡
# │ 0      ┆ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0     ┆ 1   ┆ 0         ┆ 2023-01-01 06:30:08   │
# │ 1      ┆ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1     ┆ 3   ┆ null      ┆ null                  │
# │ 2      ┆ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 2     ┆ 4   ┆ null      ┆ null                  │
# │ 3      ┆ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 3     ┆ 9   ┆ null      ┆ null                  │
# │ 4      ┆ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 4     ┆ 7   ┆ 2         ┆ 2023-01-01 06:40:01   │ # <-
# │ 5      ┆ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 4     ┆ 8   ┆ 2         ┆ 2023-01-01 06:40:01   │ # <-

We can then use the parent's departure time to ensure the windows "touch".

depart = (
   df.sort('departure_time')
     .with_columns(
        start = pl.col('departure_time').search_sorted('window_open'),
        end = pl.col('departure_time').search_sorted('window_close')
     )
    .with_row_count()
)

depart = (
   depart.join(
      arrive.drop_nulls().select('_id', 'parent_id', 'parent_departure_time'),
      on = '_id',
      how = 'left'
   )
   .with_columns(parent_end = 
      pl.col('departure_time').search_sorted('parent_departure_time', side='right'))
   .with_columns(end =
      pl.when(pl.col('parent_id').is_not_null())
        .then(pl.col('parent_end'))
        .otherwise(pl.col('end')),
      old_end = 'end'
   )
)

# shape: (12, 13)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┬───────────┬───────────────────────┬────────────┬─────────┐
# │ row_nr ┆ _id ┆ arrival_time        ┆ departure_time      ┆ ID  ┆ window_open         ┆ window_close        ┆ start ┆ end ┆ parent_id ┆ parent_departure_time ┆ parent_end ┆ old_end │
# │ ---    ┆ --- ┆ ---                 ┆ ---                 ┆ --- ┆ ---                 ┆ ---                 ┆ ---   ┆ --- ┆ ---       ┆ ---                   ┆ ---        ┆ ---     │
# │ u32    ┆ u32 ┆ datetime[μs]        ┆ datetime[μs]        ┆ str ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ u32 ┆ u32       ┆ datetime[μs]          ┆ u32        ┆ u32     │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╪═══════════╪═══════════════════════╪════════════╪═════════╡
# │ 0      ┆ 0   ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1  ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0     ┆ 1   ┆ 0         ┆ 2023-01-01 06:30:08   ┆ 1          ┆ 1       │
# │ 1      ┆ 1   ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1  ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1     ┆ 2   ┆ null      ┆ null                  ┆ 0          ┆ 2       │
# │ 2      ┆ 9   ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1  ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 1     ┆ 3   ┆ null      ┆ null                  ┆ 0          ┆ 3       │
# │ 3      ┆ 10  ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1  ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 3     ┆ 6   ┆ 2         ┆ 2023-01-01 06:40:01   ┆ 6          ┆ 5       │ # <-
# │ 4      ┆ 3   ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6  ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 3     ┆ 6   ┆ 2         ┆ 2023-01-01 06:40:01   ┆ 6          ┆ 5       │ # <-
# │ 5      ┆ 2   ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5  ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 2     ┆ 6   ┆ null      ┆ null                  ┆ 0          ┆ 6       │
# │ 6      ┆ 4   ┆ 2023-01-01 06:38:09 ┆ 2023-01-01 06:41:01 ┆ B3  ┆ 2023-01-01 06:37:09 ┆ 2023-01-01 06:42:01 ┆ 3     ┆ 7   ┆ null      ┆ null                  ┆ 0          ┆ 7       │
# │ 7      ┆ 6   ┆ 2023-01-01 06:41:40 ┆ 2023-01-01 06:43:34 ┆ A6  ┆ 2023-01-01 06:40:40 ┆ 2023-01-01 06:44:34 ┆ 6     ┆ 10  ┆ 11        ┆ 2023-01-01 06:45:01   ┆ 10         ┆ 8       │ # <-
# │ 8      ┆ 5   ┆ 2023-01-01 06:39:08 ┆ 2023-01-01 06:44:49 ┆ C3  ┆ 2023-01-01 06:38:08 ┆ 2023-01-01 06:45:49 ┆ 4     ┆ 11  ┆ null      ┆ null                  ┆ 0          ┆ 11      │
# │ 9      ┆ 11  ┆ 2023-01-01 06:40:20 ┆ 2023-01-01 06:45:01 ┆ A5  ┆ 2023-01-01 06:39:20 ┆ 2023-01-01 06:46:01 ┆ 5     ┆ 11  ┆ null      ┆ null                  ┆ 0          ┆ 11      │
# │ 10     ┆ 7   ┆ 2023-01-01 06:42:43 ┆ 2023-01-01 06:45:48 ┆ A5  ┆ 2023-01-01 06:41:43 ┆ 2023-01-01 06:46:48 ┆ 7     ┆ 11  ┆ null      ┆ null                  ┆ 0          ┆ 11      │
# │ 11     ┆ 8   ┆ 2023-01-01 06:44:48 ┆ 2023-01-01 06:51:10 ┆ A6  ┆ 2023-01-01 06:43:48 ┆ 2023-01-01 06:52:10 ┆ 8     ┆ 12  ┆ null      ┆ null                  ┆ 0          ┆ 12      │
# └────────┴─────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┴───────┴─────┴───────────┴───────────────────────┴────────────┴─────────┘

These adjusted start/end positions now give the exact group slice.

Testing it out with a few different inputs generates the same output as the earlier polars and duckdb attempts so it seems to be equivalent (unless I've missed something that doesn't show up in this specific example data).

Although @areeh's approach of using actual times instead of slicing row numbers makes much more sense #9691

cmdlineluser · 2023-07-28T15:06:39Z

Closing in favour of #9691 - however, it seems this will all be superseded by non-equi joins #10068

cmdlineluser added the enhancement New feature or an improvement of an existing feature label Jun 20, 2023

erinov1 mentioned this issue Jun 21, 2023

Obtain multiple ranges of data (for aggregation) from a huge dataset - similar to R's foverlaps() in the data.table package #9481

Closed

cmdlineluser mentioned this issue Jun 29, 2023

Rust: access to groupby agg from custom GroupsProxy? #7211

Closed

2 tasks

areeh mentioned this issue Jul 3, 2023

groupby using a sequence of start/end times of windows #9691

Closed

cmdlineluser closed this as completed Jul 28, 2023

avimallu mentioned this issue Jul 29, 2023

Non-equi join tracking issue #10068

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.groupby_slices()` #9467

`.groupby_slices()` #9467

cmdlineluser commented Jun 20, 2023

avimallu commented Jun 20, 2023

erinov1 commented Jun 20, 2023

cmdlineluser commented Jun 21, 2023

erinov1 commented Jun 22, 2023

cmdlineluser commented Jun 22, 2023

avimallu commented Jun 22, 2023 •

edited

Loading

cmdlineluser commented Jun 22, 2023 •

edited

Loading

avimallu commented Jun 22, 2023 •

edited

Loading

cmdlineluser commented Jun 22, 2023 •

edited

Loading

avimallu commented Jun 22, 2023 •

edited

Loading

cmdlineluser commented Jun 22, 2023 •

edited

Loading

cmdlineluser commented Jun 23, 2023

avimallu commented Jun 28, 2023 •

edited

Loading

cmdlineluser commented Jun 30, 2023

avimallu commented Jun 30, 2023 •

edited

Loading

cmdlineluser commented Jun 30, 2023

avimallu commented Jun 30, 2023

cmdlineluser commented Jun 30, 2023

cmdlineluser commented Jul 6, 2023

cmdlineluser commented Jul 28, 2023

.groupby_slices() #9467

.groupby_slices() #9467

Comments

cmdlineluser commented Jun 20, 2023

Problem description

avimallu commented Jun 20, 2023

erinov1 commented Jun 20, 2023

cmdlineluser commented Jun 21, 2023

erinov1 commented Jun 22, 2023

cmdlineluser commented Jun 22, 2023

avimallu commented Jun 22, 2023 • edited Loading

cmdlineluser commented Jun 22, 2023 • edited Loading

avimallu commented Jun 22, 2023 • edited Loading

EDIT

cmdlineluser commented Jun 22, 2023 • edited Loading

avimallu commented Jun 22, 2023 • edited Loading

cmdlineluser commented Jun 22, 2023 • edited Loading

cmdlineluser commented Jun 23, 2023

avimallu commented Jun 28, 2023 • edited Loading

cmdlineluser commented Jun 30, 2023

avimallu commented Jun 30, 2023 • edited Loading

cmdlineluser commented Jun 30, 2023

avimallu commented Jun 30, 2023

cmdlineluser commented Jun 30, 2023

cmdlineluser commented Jul 6, 2023

cmdlineluser commented Jul 28, 2023

`.groupby_slices()` #9467

`.groupby_slices()` #9467

avimallu commented Jun 22, 2023 •

edited

Loading

cmdlineluser commented Jun 22, 2023 •

edited

Loading

avimallu commented Jun 22, 2023 •

edited

Loading

cmdlineluser commented Jun 22, 2023 •

edited

Loading

avimallu commented Jun 22, 2023 •

edited

Loading

cmdlineluser commented Jun 22, 2023 •

edited

Loading

avimallu commented Jun 28, 2023 •

edited

Loading

avimallu commented Jun 30, 2023 •

edited

Loading