Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Dynamic" groupby using a second dataframe of start/end times or indexes #5891

Open
erinov1 opened this issue Dec 23, 2022 · 4 comments
Open
Labels
enhancement New feature or an improvement of an existing feature

Comments

@erinov1
Copy link

erinov1 commented Dec 23, 2022

Problem description

Using groupby_dynamic/rolling you can create temporal groups using windows of specified length. I'd like to explicitly create groups from a second dataframe containing the start and end times for the group (either on a date column or an index column). For example, consider the following:

from datetime import date

df = pl.DataFrame(
    {
        "date": [
            date(2000, 1, 1),
            date(2000, 6, 1),
            date(2001, 1, 1),
            date(2003, 1, 1),
            date(2022, 1, 1),
        ],
        "id": [1, 1, 1, 2, 2],
        "values": ["A", "B", "C", "D", "E"],
    }
)

date_lookup = pl.DataFrame(
    {
        "id": [1, 1, 2],
        "start_date": [date(2000, 1, 1), date(2000, 1, 1), date(2002, 6, 1)],
        "end_date": [date(2000, 7, 1), date(2002, 1, 1), date(2003, 6, 1)],
    }
)

I'd like to use date_lookup to create a grouping for df. Namely, for id = 1 I'd like to be able to group rows where date is in the interval [2000-01-01, 2000-07] and [2000-01-01, 2002-01-01], whereas for id = 2 I'd like rows where date is in the interval [2002-06-01, 2003-06-01].

As far as I know this is not currently possible? I can finagle a dataframe that contains the index slices I want (although the way I'm doing this is quite slow), but even then I don't know how to aggregate over each of those different slices without doing a for-loop over slices.

Presumably there could be a faster native implementation (which hopefully does not involve copying data for overlapping groups). Ideally this would work for either time-like columns or integer columns (requiring that the dataframes are all sorted, of course).

Alternatively, if there is an existing solution I would be happy to hear about it!

@erinov1 erinov1 added the enhancement New feature or an improvement of an existing feature label Dec 23, 2022
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Dec 23, 2022

There are probably better ways - but for a start - are these the groups you'd expect from your example?

(df.join(date_lookup, on="id").groupby([
   pl.col("id"), 
   pl.col("date").is_between(
      pl.col("start_date"), pl.col("end_date"), include_bounds=True)
]).agg_list()
  .filter("is_between")
  .drop("is_between")
  .explode(pl.exclude("id"))
  .groupby(pl.col(["id", "start_date", "end_date"]))
  .agg_list())
shape: (3, 5)
┌─────┬────────────┬────────────┬─────────────────────────────────────┬─────────────────┐
│ id  | start_date | end_date   | date                                | values          │
│ --- | ---        | ---        | ---                                 | ---             │
│ i64 | date       | date       | list[date]                          | list[str]       │
╞═════╪════════════╪════════════╪═════════════════════════════════════╪═════════════════╡
│ 1   | 2000-01-01 | 2002-01-01 | [2000-01-01, 2000-06-01, 2001-01... | ["A", "B", "C"] │
├─────┼────────────┼────────────┼─────────────────────────────────────┼─────────────────┤
│ 1   | 2000-01-01 | 2000-07-01 | [2000-01-01, 2000-06-01]            | ["A", "B"]      │
├─────┼────────────┼────────────┼─────────────────────────────────────┼─────────────────┤
│ 2   | 2002-06-01 | 2003-06-01 | [2003-01-01]                        | ["D"]           │
└─────┴────────────┴────────────┴─────────────────────────────────────┴─────────────────┘

@cmdlineluser
Copy link
Contributor

Again not an efficient solution due to looping over each row in date_lookup:

df = df.sort("date")
date_lookup = date_lookup.sort(["start_date", "end_date"])
for n in range(date_lookup.height): (
   df.join_asof(   
      date_lookup.slice(n, 1),
      left_on="date",
      right_on="end_date",
      by="id",
      strategy="forward"
   ).filter(
      pl.col("date").is_between(
         pl.col("start_date"), 
         pl.col("end_date"), 
         include_bounds=True)
   )
)
shape: (2, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2000-01-01 | 1   | A      | 2000-01-01 | 2000-07-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2000-06-01 | 1   | B      | 2000-01-01 | 2000-07-01 │
└────────────┴─────┴────────┴────────────┴────────────┘
shape: (3, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2000-01-01 | 1   | A      | 2000-01-01 | 2002-01-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2000-06-01 | 1   | B      | 2000-01-01 | 2002-01-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2001-01-01 | 1   | C      | 2000-01-01 | 2002-01-01 │
└────────────┴─────┴────────┴────────────┴────────────┘
shape: (1, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2003-01-01 | 2   | D      | 2002-06-01 | 2003-06-01 │
└────────────┴─────┴────────┴────────────┴────────────┘

It seems like you could generate the same groups from the result of a single .join_asof():

(
   df.sort("date").join_asof(   
      date_lookup.sort(["start_date", "end_date"]),
      left_on="date",
      right_on="end_date",
      by="id",
      strategy="forward"
   ).filter(
      pl.col("date").is_between(
         pl.col("start_date"), 
         pl.col("end_date"), 
         include_bounds=True)
   )
)
shape: (4, 5)
┌────────────┬─────┬────────┬────────────┬────────────┐
│ date       | id  | values | start_date | end_date   │
│ ---        | --- | ---    | ---        | ---        │
│ date       | i64 | str    | date       | date       │
╞════════════╪═════╪════════╪════════════╪════════════╡
│ 2000-01-01 | 1   | A      | 2000-01-01 | 2000-07-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2000-06-01 | 1   | B      | 2000-01-01 | 2000-07-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2001-01-01 | 1   | C      | 2000-01-01 | 2002-01-01 │
├────────────┼─────┼────────┼────────────┼────────────┤
│ 2003-01-01 | 2   | D      | 2002-06-01 | 2003-06-01 │
└────────────┴─────┴────────┴────────────┴────────────┘

Grouping by ["id", "end_date"] gets one level - but then you'd need to find those that intersect.

(
   df.sort("date").join_asof(   
      date_lookup.sort(["start_date", "end_date"]),
      left_on="date",
      right_on="end_date",
      by="id",
      strategy="forward"
   ).filter(
      pl.col("date").is_between(
         pl.col("start_date"), 
         pl.col("end_date"), 
         include_bounds=True)
   )
).groupby(["id", "end_date"]).agg_list()
shape: (3, 5)
┌─────┬────────────┬──────────────────────────┬────────────┬──────────────────────────┐
│ id  | end_date   | date                     | values     | start_date               │
│ --- | ---        | ---                      | ---        | ---                      │
│ i64 | date       | list[date]               | list[str]  | list[date]               │
╞═════╪════════════╪══════════════════════════╪════════════╪══════════════════════════╡
│ 1   | 2000-07-01 | [2000-01-01, 2000-06-01] | ["A", "B"] | [2000-01-01, 2000-01-01] │
├─────┼────────────┼──────────────────────────┼────────────┼──────────────────────────┤
│ 2   | 2003-06-01 | [2003-01-01]             | ["D"]      | [2002-06-01]             │
├─────┼────────────┼──────────────────────────┼────────────┼──────────────────────────┤
│ 1   | 2002-01-01 | [2001-01-01]             | ["C"]      | [2000-01-01]             │
└─────┴────────────┴──────────────────────────┴────────────┴──────────────────────────┘

@erinov1
Copy link
Author

erinov1 commented Dec 23, 2022

Thanks, @cmdlineluser . Your first and second solutions work (the third does not work when date belongs to multiple start_date/end_date intervals). I'm worried that the first method is essentially doing a large cross join during the initial merge (although maybe the streaming engine will help with memory usage). Your second solution is similar to what I am doing now:

df_with_index = df.with_row_count("index").select(["date", "id", "index"])
slice_lookup = (
    date_lookup.join_asof(
        df_with_index,
        left_on="start_date",
        right_on="date",
        by="id",
        strategy="forward",
    )
    .join_asof(
        df_with_index, left_on="end_date", right_on="date", by="id", strategy="backward"
    )
    .select(
        [
            "id",
            "start_date",
            "end_date",
            pl.col("index").alias("offset"),
            (pl.col("index_right") - pl.col("index") + 1).alias("length"),
        ]
    )
)

for slice in slice_lookup.select(["offset", "length"]).to_numpy():
    print(df.slice(slice[0], slice[1]))

which prints

shape: (2, 3)
┌────────────┬─────┬────────┐
│ dateidvalues │
│ ---------    │
│ datei64str    │
╞════════════╪═════╪════════╡
│ 2000-01-011A      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2000-06-011B      │
└────────────┴─────┴────────┘
shape: (3, 3)
┌────────────┬─────┬────────┐
│ dateidvalues │
│ ---------    │
│ datei64str    │
╞════════════╪═════╪════════╡
│ 2000-01-011A      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2000-06-011B      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2001-01-011C      │
└────────────┴─────┴────────┘
shape: (1, 3)
┌────────────┬─────┬────────┐
│ dateidvalues │
│ ---------    │
│ datei64str    │
╞════════════╪═════╪════════╡
│ 2003-01-012D      │
└────────────┴─────┴────────┘

The problem could also be solved using non-equi joins, so I will link to this issue: #4207

@n8henrie
Copy link
Contributor

Cross join solution suggested to me for a similar problem on the discord: https://discord.com/channels/908022250106667068/1014967651656814694/1073895540997226536

# example setup code
import datetime as dt
import random

import polars as pl

start_date = dt.datetime(2020, 5, 1)
num_samples = 1_000


def random_date():
    return start_date + dt.timedelta(minutes=random.randint(0, 100_000))

df1 = pl.DataFrame(
    {
        "start_date": (random_date() for _ in range(num_samples)),
    }
).with_columns(
    pl.when(pl.col("start_date").dt.hour() < 12)
    .then(pl.col("start_date") + pl.duration(hours=10))
    .otherwise(pl.col("start_date") + pl.duration(hours=12))
    .alias("end_date")
)
df2 = pl.DataFrame(
    {
        "date": (random_date() for _ in range(num_samples)),
    }
)

# solution
df2.lazy().join(
    df1.lazy().with_row_count("time_group_id"), how="cross"
).with_columns(
    pl.col("date")
    .is_between(pl.col("start_date"), pl.col("end_date"))
    .alias("bt")
).groupby(
    ["time_group_id", "start_date", "end_date"]
).agg(
    pl.col("date").count().alias("counts")
).collect(
    streaming=True
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants