-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Dynamic" groupby using a second dataframe of start/end times or indexes #5891
Comments
There are probably better ways - but for a start - are these the groups you'd expect from your example?
|
Again not an efficient solution due to looping over each row in
It seems like you could generate the same groups from the result of a single
Grouping by
|
Thanks, @cmdlineluser . Your first and second solutions work (the third does not work when df_with_index = df.with_row_count("index").select(["date", "id", "index"])
slice_lookup = (
date_lookup.join_asof(
df_with_index,
left_on="start_date",
right_on="date",
by="id",
strategy="forward",
)
.join_asof(
df_with_index, left_on="end_date", right_on="date", by="id", strategy="backward"
)
.select(
[
"id",
"start_date",
"end_date",
pl.col("index").alias("offset"),
(pl.col("index_right") - pl.col("index") + 1).alias("length"),
]
)
)
for slice in slice_lookup.select(["offset", "length"]).to_numpy():
print(df.slice(slice[0], slice[1])) which prints shape: (2, 3)
┌────────────┬─────┬────────┐
│ date ┆ id ┆ values │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ str │
╞════════════╪═════╪════════╡
│ 2000-01-01 ┆ 1 ┆ A │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2000-06-01 ┆ 1 ┆ B │
└────────────┴─────┴────────┘
shape: (3, 3)
┌────────────┬─────┬────────┐
│ date ┆ id ┆ values │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ str │
╞════════════╪═════╪════════╡
│ 2000-01-01 ┆ 1 ┆ A │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2000-06-01 ┆ 1 ┆ B │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2001-01-01 ┆ 1 ┆ C │
└────────────┴─────┴────────┘
shape: (1, 3)
┌────────────┬─────┬────────┐
│ date ┆ id ┆ values │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ str │
╞════════════╪═════╪════════╡
│ 2003-01-01 ┆ 2 ┆ D │
└────────────┴─────┴────────┘ The problem could also be solved using non-equi joins, so I will link to this issue: #4207 |
Cross join solution suggested to me for a similar problem on the discord: https://discord.com/channels/908022250106667068/1014967651656814694/1073895540997226536 # example setup code
import datetime as dt
import random
import polars as pl
start_date = dt.datetime(2020, 5, 1)
num_samples = 1_000
def random_date():
return start_date + dt.timedelta(minutes=random.randint(0, 100_000))
df1 = pl.DataFrame(
{
"start_date": (random_date() for _ in range(num_samples)),
}
).with_columns(
pl.when(pl.col("start_date").dt.hour() < 12)
.then(pl.col("start_date") + pl.duration(hours=10))
.otherwise(pl.col("start_date") + pl.duration(hours=12))
.alias("end_date")
)
df2 = pl.DataFrame(
{
"date": (random_date() for _ in range(num_samples)),
}
)
# solution
df2.lazy().join(
df1.lazy().with_row_count("time_group_id"), how="cross"
).with_columns(
pl.col("date")
.is_between(pl.col("start_date"), pl.col("end_date"))
.alias("bt")
).groupby(
["time_group_id", "start_date", "end_date"]
).agg(
pl.col("date").count().alias("counts")
).collect(
streaming=True
) |
Problem description
Using
groupby_dynamic/rolling
you can create temporal groups using windows of specified length. I'd like to explicitly create groups from a second dataframe containing the start and end times for the group (either on a date column or an index column). For example, consider the following:I'd like to use
date_lookup
to create a grouping fordf
. Namely, forid = 1
I'd like to be able to group rows wheredate
is in the interval[2000-01-01, 2000-07]
and[2000-01-01, 2002-01-01]
, whereas forid = 2
I'd like rows wheredate
is in the interval[2002-06-01, 2003-06-01]
.As far as I know this is not currently possible? I can finagle a dataframe that contains the index slices I want (although the way I'm doing this is quite slow), but even then I don't know how to aggregate over each of those different slices without doing a for-loop over slices.
Presumably there could be a faster native implementation (which hopefully does not involve copying data for overlapping groups). Ideally this would work for either time-like columns or integer columns (requiring that the dataframes are all sorted, of course).
Alternatively, if there is an existing solution I would be happy to hear about it!
The text was updated successfully, but these errors were encountered: