"Dynamic" groupby using a second dataframe of start/end times or indexes #5891

erinov1 · 2022-12-23T04:33:43Z

Problem description

Using groupby_dynamic/rolling you can create temporal groups using windows of specified length. I'd like to explicitly create groups from a second dataframe containing the start and end times for the group (either on a date column or an index column). For example, consider the following:

from datetime import date

df = pl.DataFrame(
    {
        "date": [
            date(2000, 1, 1),
            date(2000, 6, 1),
            date(2001, 1, 1),
            date(2003, 1, 1),
            date(2022, 1, 1),
        ],
        "id": [1, 1, 1, 2, 2],
        "values": ["A", "B", "C", "D", "E"],
    }
)

date_lookup = pl.DataFrame(
    {
        "id": [1, 1, 2],
        "start_date": [date(2000, 1, 1), date(2000, 1, 1), date(2002, 6, 1)],
        "end_date": [date(2000, 7, 1), date(2002, 1, 1), date(2003, 6, 1)],
    }
)

I'd like to use date_lookup to create a grouping for df. Namely, for id = 1 I'd like to be able to group rows where date is in the interval [2000-01-01, 2000-07] and [2000-01-01, 2002-01-01], whereas for id = 2 I'd like rows where date is in the interval [2002-06-01, 2003-06-01].

As far as I know this is not currently possible? I can finagle a dataframe that contains the index slices I want (although the way I'm doing this is quite slow), but even then I don't know how to aggregate over each of those different slices without doing a for-loop over slices.

Presumably there could be a faster native implementation (which hopefully does not involve copying data for overlapping groups). Ideally this would work for either time-like columns or integer columns (requiring that the dataframes are all sorted, of course).

Alternatively, if there is an existing solution I would be happy to hear about it!

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2022-12-23T07:04:03Z

There are probably better ways - but for a start - are these the groups you'd expect from your example?

(df.join(date_lookup, on="id").groupby([
   pl.col("id"), 
   pl.col("date").is_between(
      pl.col("start_date"), pl.col("end_date"), include_bounds=True)
]).agg_list()
  .filter("is_between")
  .drop("is_between")
  .explode(pl.exclude("id"))
  .groupby(pl.col(["id", "start_date", "end_date"]))
  .agg_list())

shape: (3, 5)
┌─────┬────────────┬────────────┬─────────────────────────────────────┬─────────────────┐
│ id  | start_date | end_date   | date                                | values          │
│ --- | ---        | ---        | ---                                 | ---             │
│ i64 | date       | date       | list[date]                          | list[str]       │
╞═════╪════════════╪════════════╪═════════════════════════════════════╪═════════════════╡
│ 1   | 2000-01-01 | 2002-01-01 | [2000-01-01, 2000-06-01, 2001-01... | ["A", "B", "C"] │
├─────┼────────────┼────────────┼─────────────────────────────────────┼─────────────────┤
│ 1   | 2000-01-01 | 2000-07-01 | [2000-01-01, 2000-06-01]            | ["A", "B"]      │
├─────┼────────────┼────────────┼─────────────────────────────────────┼─────────────────┤
│ 2   | 2002-06-01 | 2003-06-01 | [2003-01-01]                        | ["D"]           │
└─────┴────────────┴────────────┴─────────────────────────────────────┴─────────────────┘

cmdlineluser · 2022-12-23T10:11:26Z

Again not an efficient solution due to looping over each row in date_lookup:

df = df.sort("date")
date_lookup = date_lookup.sort(["start_date", "end_date"])
for n in range(date_lookup.height): (
   df.join_asof(   
      date_lookup.slice(n, 1),
      left_on="date",
      right_on="end_date",
      by="id",
      strategy="forward"
   ).filter(
      pl.col("date").is_between(
         pl.col("start_date"), 
         pl.col("end_date"), 
         include_bounds=True)
   )
)

shape: (2, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2000-01-01 | 1   | A      | 2000-01-01 | 2000-07-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2000-06-01 | 1   | B      | 2000-01-01 | 2000-07-01 │
└────────────┴─────┴────────┴────────────┴────────────┘
shape: (3, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2000-01-01 | 1   | A      | 2000-01-01 | 2002-01-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2000-06-01 | 1   | B      | 2000-01-01 | 2002-01-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2001-01-01 | 1   | C      | 2000-01-01 | 2002-01-01 │
└────────────┴─────┴────────┴────────────┴────────────┘
shape: (1, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2003-01-01 | 2   | D      | 2002-06-01 | 2003-06-01 │
└────────────┴─────┴────────┴────────────┴────────────┘

It seems like you could generate the same groups from the result of a single .join_asof():

(
   df.sort("date").join_asof(   
      date_lookup.sort(["start_date", "end_date"]),
      left_on="date",
      right_on="end_date",
      by="id",
      strategy="forward"
   ).filter(
      pl.col("date").is_between(
         pl.col("start_date"), 
         pl.col("end_date"), 
         include_bounds=True)
   )
)

shape: (4, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2000-01-01 | 1   | A      | 2000-01-01 | 2000-07-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2000-06-01 | 1   | B      | 2000-01-01 | 2000-07-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2001-01-01 | 1   | C      | 2000-01-01 | 2002-01-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2003-01-01 | 2   | D      | 2002-06-01 | 2003-06-01 │
└────────────┴─────┴────────┴────────────┴────────────┘

Grouping by ["id", "end_date"] gets one level - but then you'd need to find those that intersect.

(
   df.sort("date").join_asof(   
      date_lookup.sort(["start_date", "end_date"]),
      left_on="date",
      right_on="end_date",
      by="id",
      strategy="forward"
   ).filter(
      pl.col("date").is_between(
         pl.col("start_date"), 
         pl.col("end_date"), 
         include_bounds=True)
   )
).groupby(["id", "end_date"]).agg_list()
shape: (3, 5)
┌─────┬────────────┬──────────────────────────┬────────────┬──────────────────────────┐
│ id  | end_date   | date                     | values     | start_date               │
│ --- | ---        | ---                      | ---        | ---                      │
│ i64 | date       | list[date]               | list[str]  | list[date]               │
╞═════╪════════════╪══════════════════════════╪════════════╪══════════════════════════╡
│ 1   | 2000-07-01 | [2000-01-01, 2000-06-01] | ["A", "B"] | [2000-01-01, 2000-01-01] │
├─────┼────────────┼──────────────────────────┼────────────┼──────────────────────────┤
│ 2   | 2003-06-01 | [2003-01-01]             | ["D"]      | [2002-06-01]             │
├─────┼────────────┼──────────────────────────┼────────────┼──────────────────────────┤
│ 1   | 2002-01-01 | [2001-01-01]             | ["C"]      | [2000-01-01]             │
└─────┴────────────┴──────────────────────────┴────────────┴──────────────────────────┘

erinov1 · 2022-12-23T17:12:27Z

Thanks, @cmdlineluser . Your first and second solutions work (the third does not work when date belongs to multiple start_date/end_date intervals). I'm worried that the first method is essentially doing a large cross join during the initial merge (although maybe the streaming engine will help with memory usage). Your second solution is similar to what I am doing now:

df_with_index = df.with_row_count("index").select(["date", "id", "index"])
slice_lookup = (
    date_lookup.join_asof(
        df_with_index,
        left_on="start_date",
        right_on="date",
        by="id",
        strategy="forward",
    )
    .join_asof(
        df_with_index, left_on="end_date", right_on="date", by="id", strategy="backward"
    )
    .select(
        [
            "id",
            "start_date",
            "end_date",
            pl.col("index").alias("offset"),
            (pl.col("index_right") - pl.col("index") + 1).alias("length"),
        ]
    )
)

for slice in slice_lookup.select(["offset", "length"]).to_numpy():
    print(df.slice(slice[0], slice[1]))

which prints

shape: (2, 3)
┌────────────┬─────┬────────┐
│ date       ┆ id  ┆ values │
│ ---        ┆ --- ┆ ---    │
│ date       ┆ i64 ┆ str    │
╞════════════╪═════╪════════╡
│ 2000-01-01 ┆ 1   ┆ A      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2000-06-01 ┆ 1   ┆ B      │
└────────────┴─────┴────────┘
shape: (3, 3)
┌────────────┬─────┬────────┐
│ date       ┆ id  ┆ values │
│ ---        ┆ --- ┆ ---    │
│ date       ┆ i64 ┆ str    │
╞════════════╪═════╪════════╡
│ 2000-01-01 ┆ 1   ┆ A      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2000-06-01 ┆ 1   ┆ B      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2001-01-01 ┆ 1   ┆ C      │
└────────────┴─────┴────────┘
shape: (1, 3)
┌────────────┬─────┬────────┐
│ date       ┆ id  ┆ values │
│ ---        ┆ --- ┆ ---    │
│ date       ┆ i64 ┆ str    │
╞════════════╪═════╪════════╡
│ 2003-01-01 ┆ 2   ┆ D      │
└────────────┴─────┴────────┘

The problem could also be solved using non-equi joins, so I will link to this issue: #4207

n8henrie · 2023-02-22T15:20:34Z

Cross join solution suggested to me for a similar problem on the discord: https://discord.com/channels/908022250106667068/1014967651656814694/1073895540997226536

# example setup code
import datetime as dt
import random

import polars as pl

start_date = dt.datetime(2020, 5, 1)
num_samples = 1_000


def random_date():
    return start_date + dt.timedelta(minutes=random.randint(0, 100_000))

df1 = pl.DataFrame(
    {
        "start_date": (random_date() for _ in range(num_samples)),
    }
).with_columns(
    pl.when(pl.col("start_date").dt.hour() < 12)
    .then(pl.col("start_date") + pl.duration(hours=10))
    .otherwise(pl.col("start_date") + pl.duration(hours=12))
    .alias("end_date")
)
df2 = pl.DataFrame(
    {
        "date": (random_date() for _ in range(num_samples)),
    }
)

# solution
df2.lazy().join(
    df1.lazy().with_row_count("time_group_id"), how="cross"
).with_columns(
    pl.col("date")
    .is_between(pl.col("start_date"), pl.col("end_date"))
    .alias("bt")
).groupby(
    ["time_group_id", "start_date", "end_date"]
).agg(
    pl.col("date").count().alias("counts")
).collect(
    streaming=True
)

erinov1 added the enhancement New feature or an improvement of an existing feature label Dec 23, 2022

This was referenced Jun 20, 2023

.groupby_slices() #9467

Closed

Obtain multiple ranges of data (for aggregation) from a huge dataset - similar to R's foverlaps() in the data.table package #9481

Closed

cmdlineluser mentioned this issue Jun 29, 2023

Rust: access to groupby agg from custom GroupsProxy? #7211

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Dynamic" groupby using a second dataframe of start/end times or indexes #5891

"Dynamic" groupby using a second dataframe of start/end times or indexes #5891

erinov1 commented Dec 23, 2022 •

edited

Loading

cmdlineluser commented Dec 23, 2022 •

edited

Loading

cmdlineluser commented Dec 23, 2022

erinov1 commented Dec 23, 2022 •

edited

Loading

n8henrie commented Feb 22, 2023

"Dynamic" groupby using a second dataframe of start/end times or indexes #5891

"Dynamic" groupby using a second dataframe of start/end times or indexes #5891

Comments

erinov1 commented Dec 23, 2022 • edited Loading

Problem description

cmdlineluser commented Dec 23, 2022 • edited Loading

cmdlineluser commented Dec 23, 2022

erinov1 commented Dec 23, 2022 • edited Loading

n8henrie commented Feb 22, 2023

erinov1 commented Dec 23, 2022 •

edited

Loading

cmdlineluser commented Dec 23, 2022 •

edited

Loading

erinov1 commented Dec 23, 2022 •

edited

Loading