-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.groupby_slices()
#9467
Comments
Are you restricted to using Polars exclusively? This looks like a non-equi join problem in SQL, where I would approach it with a self-join and two inequality conditions for row being within +/- 1 minute of itself. |
Related issues (in addition to various requsts for non-equi joins): #5891, #7211. For what it's worth, in my own applications I've found that if I just want aggregate statistics for each slice, then it is best to use polars to find the numerical indices for each slice, convert everything to numpy, then use numba to parallel-iterate (using |
Thanks @avimallu - there are no restrictions, I was mostly just curious about doing it in polars without a cross join. Thanks @erinov1 - I should have remembered #5891 as I commented in there. Very interesting about the numba approach - do you have a code example handy? No worries if it's too much trouble - i'll experiment with it. |
@cmdlineluser I have no idea if this is efficient for finding unique elements in a string column (I've been using it to compute statistics for numerical columns), but something like import numpy as np
from numba import njit, prange
slices = df.select("start", "end").to_numpy()
ids = df.get_column("ID").to_numpy().astype(np.unicode_)
@njit(parallel=True)
def get_unique_count(slices, ids):
unique = np.empty((slices.shape[0],))
for n in prange(slices.shape[0]):
start, end = slices[n]
unique[n] = len(np.unique(ids[start:end]))
return unique
get_unique_count(slices, ids)
# array([1., 2., 5., 5., 5., 5., 4., 4., 3.]) |
Thanks a lot @erinov1 Yeah, doing anything with strings seems to slow it down quite a bit. What I did do was I had a look at that I have little rust knowledge, so this is just me taking things apart to try to get them to run: use polars::prelude::*;
pub(super) fn groupby_slices(df: DataFrame, start: DataFrame, end: DataFrame) -> PolarsResult<DataFrame> {
let start = start.column("start")?;
let end = end.column("end")?;
let sa = start.i64()?;
let sb = end.i64()?;
let groups = sa.into_iter().zip(sb.into_iter()).map(|(a, b)| {
match (a, b) {
(Some(a), Some(b)) => {
let start = a as IdxSize;
let end = b as IdxSize;
let len = end - start as IdxSize;
[start, len]
},
_ => [0, 0] // This is wrong. What to do here?
}
}).collect();
let groups = GroupsProxy::Slice { groups, rolling: false };
let out = GroupBy::new(&df, vec![], groups, None);
//Ok(out.first()?)
Ok(out.agg_list()?)
} This pretty much returns the groups instantly, however it seems this I found some further clues in: https://github.com/pola-rs/polars/blob/main/polars/polars-lazy/src/dsl/list.rs#L113 I can't seem to figure out how to assign the groups without actually running an aggregation so just I'm a bit lost with this Arena/LogicalPlan stuff. |
A very elegant (if I say so myself) solution in SQL using non-equi joins (eagerly waiting for them to land in Polars, more so after this question): import polars as pl
data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time ┆ departure_time ┆ ID │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5 │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3 │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 │
└─────────────────────┴─────────────────────┴─────┘
""") Followed by See codeimport duckdb as db
db.query("""
SELECT
A.arrival_time
,A.departure_time
,A.window_open
,A.window_close
,LIST_UNIQUE(LIST(B.ID)) AS docked_trucks
FROM (
SELECT
*
,arrival_time - (INTERVAL 1 MINUTE) AS window_open
,departure_time + (INTERVAL 1 MINUTE) AS window_close
FROM data
) A
LEFT JOIN (
SELECT
*
,DATEDIFF('seconds', arrival_time, departure_time) AS duration
FROM data
) B
ON (
(B.arrival_time <= A.window_open AND (B.arrival_time + TO_SECONDS(B.duration)) >= A.window_open) OR
(B.arrival_time >= A.window_open AND B.departure_time <= A.window_close) OR
(B.arrival_time >= A.window_open AND (B.departure_time - TO_SECONDS(B.duration)) <= A.window_close)
)
GROUP BY 1, 2, 3, 4
""") ┌─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────────┐
│ arrival_time │ departure_time │ window_open │ window_close │ docked_trucks │
│ timestamp │ timestamp │ timestamp │ timestamp │ uint64 │
├─────────────────────┼─────────────────────┼─────────────────────┼─────────────────────┼───────────────┤
│ 2023-01-01 06:23:47 │ 2023-01-01 06:25:08 │ 2023-01-01 06:22:47 │ 2023-01-01 06:26:08 │ 1 │
│ 2023-01-01 06:26:42 │ 2023-01-01 06:28:02 │ 2023-01-01 06:25:42 │ 2023-01-01 06:29:02 │ 1 │
│ 2023-01-01 06:30:20 │ 2023-01-01 06:35:01 │ 2023-01-01 06:29:20 │ 2023-01-01 06:36:01 │ 4 │
│ 2023-01-01 06:32:06 │ 2023-01-01 06:33:48 │ 2023-01-01 06:31:06 │ 2023-01-01 06:34:48 │ 4 │
│ 2023-01-01 06:33:09 │ 2023-01-01 06:36:01 │ 2023-01-01 06:32:09 │ 2023-01-01 06:37:01 │ 4 │
│ 2023-01-01 06:34:08 │ 2023-01-01 06:39:49 │ 2023-01-01 06:33:08 │ 2023-01-01 06:40:49 │ 4 │
│ 2023-01-01 06:36:40 │ 2023-01-01 06:38:34 │ 2023-01-01 06:35:40 │ 2023-01-01 06:39:34 │ 4 │
│ 2023-01-01 06:37:43 │ 2023-01-01 06:40:48 │ 2023-01-01 06:36:43 │ 2023-01-01 06:41:48 │ 3 │
│ 2023-01-01 06:39:48 │ 2023-01-01 06:46:10 │ 2023-01-01 06:38:48 │ 2023-01-01 06:47:10 │ 3 │
└─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────────┘ The three join conditions can be more understood visually: See code
We only care about the three cases marked with This will work well with any Polars based pipeline because of the zero-copy integration that DuckDB has with it. |
Thank you for the code example @avimallu \o/ I'm not sure if it's my system or not, but duckdb seems to take a bit of time with this operation. If I increase the data a bit, it takes df = pl.concat(
df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m'))
for N in range(1, 3000)
) The polars approach from above takes import polars as pl
from codetiming import Timer
df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time ┆ departure_time ┆ ID │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5 │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3 │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 │
└─────────────────────┴─────────────────────┴─────┘
""")
df = pl.concat(
df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m'))
for N in range(1, 3000)
)
with Timer():
df = (
df
.with_columns(
window_open = pl.col("arrival_time").dt.offset_by("-1m"),
window_close = pl.col("departure_time").dt.offset_by("1m")
)
.sort("arrival_time")
.with_columns(event_id = pl.arange(0, pl.count()))
)
start = (
df.sort("window_open")
.join_asof(
df.sort("departure_time").with_row_count("start"),
left_on = "window_open",
right_on = "departure_time"
)
.select(pl.col("start").backward_fill())
)
end = (
df.sort("window_close")
.join_asof(
df.sort("arrival_time").with_row_count("end"),
left_on = "window_close",
right_on = "arrival_time",
strategy = "forward"
)
.select(pl.col("end").forward_fill() + 1)
)
(pl.concat([df, start, end], how="horizontal")
.with_columns(
pl.arange("start", "end").list.to_struct(n_field_strategy="max_width")
)
.unnest("arange")
.with_columns(
arrival = pl.concat_list(pl.col("arrival_time").take("^field_\d+$")),
departure = pl.concat_list(pl.col("departure_time").take("^field_\d+$")),
truck_id = pl.concat_list(pl.col("ID").take("^field_\d+$"))
)
.explode("arrival", "departure", "truck_id")
.filter(
pl.all(
pl.col("window_open") <= pl.col("departure"),
pl.col("window_close") >= pl.col("arrival")
)
)
.groupby("window_open", "window_close")
.agg(
pl.col("event_id", "arrival_time", "departure_time", "ID").first(),
docked_trucks = pl.col("truck_id").n_unique()
)
.sort("event_id") # sort and re-order columns for output
.select(
"event_id",
"arrival_time",
"departure_time",
"ID",
"window_open",
"window_close",
"docked_trucks",
)
) Going from |
Hmm, odd. It takes pl.concat([df, start, end], how="horizontal") What are See codeimport polars as pl
import duckdb as db
from codetiming import Timer
df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time ┆ departure_time ┆ ID │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5 │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3 │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 │
└─────────────────────┴─────────────────────┴─────┘
""")
df = pl.concat(
df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m'))
for N in range(1, 3000)
)
with Timer():
db.query("""
SELECT
A.arrival_time
,A.departure_time
,A.window_open
,A.window_close
,LIST_UNIQUE(LIST(B.ID)) AS docked_trucks
FROM (
SELECT
*
,arrival_time - (INTERVAL 1 MINUTE) AS window_open
,departure_time + (INTERVAL 1 MINUTE) AS window_close
FROM df
) A
LEFT JOIN (
SELECT
*
,DATEDIFF('seconds', arrival_time, departure_time) AS duration
FROM df
) B
ON (
(B.arrival_time <= A.window_open AND (B.arrival_time + TO_SECONDS(B.duration)) >= A.window_open) OR
(B.arrival_time >= A.window_open AND B.departure_time <= A.window_close) OR
(B.arrival_time >= A.window_open AND (B.departure_time - TO_SECONDS(B.duration)) <= A.window_close)
)
GROUP BY 1, 2, 3, 4
""") Am I missing something in the DuckDB code? By the way, DuckDB caches the query so it'll run much faster the second time - all my timings are therefore for the first time. My DuckDB version is EDITThe newer version ( |
I edited in start/end. Thanks for the timings, I tried a reboot, I thought it may be due to my system being overloaded but I get the same results - I am also using I'm using an older laptop, so that may have something to with it - looking at |
Thanks for the edit; you should probably raise an issue with the DuckDB team given the slower speed on your system. Here are my timings now (there is a significant variance in the timings per run). DuckDB: Elapsed time: 0.1050 seconds Given the easier query, I'd probably stick to the former for now. Pretty impressive that Polars can reach that speed even without dedicated non-equi joins. Really excited for when they land 🤯. |
I was in an interactive session. The slowdown seems to be related to the duckdb progress bar somehow. Running the code in a standalone script gives the same results as you. Oddly enough, if I add At least I have proper timings now, no more testing in an interactive session it seems, thank you. |
I realized that instead of going sideways by adding columns, it makes more sense to go lengthways. df = df.lazy()
start = start.lazy()
end = end.lazy()
(
df.select('event_id').with_context([start, end])
.select('event_id', pl.arange('start', 'end'))
.explode('arange')
.with_context(df)
.select(pl.all().take(pl.col('arange')), group = 'event_id')
.collect()
)
# shape: (47, 8)
# ┌──────────┬────────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┐
# │ event_id ┆ arange ┆ arrival_time ┆ departure_time ┆ ID ┆ window_open ┆ window_close ┆ group │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ str ┆ datetime[μs] ┆ datetime[μs] ┆ i64 │
# ╞══════════╪════════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╡
# │ 0 ┆ 0 ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 0 │
# │ 0 ┆ 1 ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 0 │
# │ 0 ┆ 0 ┆ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 ┆ 2023-01-01 06:22:47 ┆ 2023-01-01 06:26:08 ┆ 1 │
# │ 0 ┆ 1 ┆ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 ┆ 2023-01-01 06:25:42 ┆ 2023-01-01 06:29:02 ┆ 1 │
# │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
# │ 2 ┆ 1 ┆ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 ┆ 2023-01-01 06:33:08 ┆ 2023-01-01 06:40:49 ┆ 8 │
# │ 2 ┆ 2 ┆ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 ┆ 2023-01-01 06:35:40 ┆ 2023-01-01 06:39:34 ┆ 8 │
# │ 2 ┆ 3 ┆ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 ┆ 2023-01-01 06:36:43 ┆ 2023-01-01 06:41:48 ┆ 8 │
# │ 2 ┆ 4 ┆ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 ┆ 2023-01-01 06:38:48 ┆ 2023-01-01 06:47:10 ┆ 8 │
# └──────────┴────────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┴───────┘ So you actually create the rows for each group which you can then join. df = df.lazy()
start = start.lazy()
end = end.lazy()
groups = (
df.select('event_id').with_context([start, end])
.select('event_id', pl.arange('start', 'end'))
.explode('arange')
.with_context(df)
.select(pl.all().take(pl.col('arange')), group = 'event_id')
)
(df.join(groups, left_on='event_id', right_on='group', how='left')
.filter(pl.all(
pl.col("window_open") <= pl.col("departure_time_right"),
pl.col("window_close") >= pl.col("arrival_time_right")
))
.groupby('window_open', 'window_close', maintain_order=True)
.agg(
pl.col('event_id', 'arrival_time', 'departure_time', 'ID').first(),
docked_trucks = pl.n_unique('ID_right')
)
.collect()
) |
I was able to brainstorm with a few folks and find a simpler approach on the SQL front: See codeimport polars as pl
import duckdb as db
data = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time ┆ departure_time ┆ ID │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5 │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3 │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 │
└─────────────────────┴─────────────────────┴─────┘
""")
db.query("""
SELECT
A.arrival_time
,A.departure_time
,A.arrival_time - (INTERVAL 1 MINUTE) AS window_open
,A.departure_time + (INTERVAL 1 MINUTE) AS window_close
,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
,LIST_UNIQUE(LIST(B.ID)) AS docked_truck_count
FROM data A, data B
WHERE B.arrival_time <= window_close
AND B.departure_time >= window_open
GROUP BY 1, 2, 3, 4
""").pl() This reduces the number of conditions to just two, and the SQL optimizer engine automatically converts the cross join to an interval join, so this is probably the most readable query while still being just as fast (or faster). |
Thank you for the update @avimallu - that does seem like a nice improvement. Some interesting things I discovered from experimenting with your example:
Pretty neat stuff. I think I spotted where the real issue with the timing mismatch came from. When getting It just has Adding a Perhaps you can confirm this. |
@avimallu I've received access to an M1 Mac Pro and just installed a fresh polars and duckdb. Using the code from #9467 (comment)
Adding a
I've also tested writing the output from within duckdb e.g. It definitely looks like the |
Good point. I think something happens with See codeimport polars as pl
import duckdb as db
from codetiming import Timer
df = pl.from_repr("""
┌─────────────────────┬─────────────────────┬─────┐
│ arrival_time ┆ departure_time ┆ ID │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ str │
╞═════════════════════╪═════════════════════╪═════╡
│ 2023-01-01 06:23:47 ┆ 2023-01-01 06:25:08 ┆ A1 │
│ 2023-01-01 06:26:42 ┆ 2023-01-01 06:28:02 ┆ A1 │
│ 2023-01-01 06:30:20 ┆ 2023-01-01 06:35:01 ┆ A5 │
│ 2023-01-01 06:32:06 ┆ 2023-01-01 06:33:48 ┆ A6 │
│ 2023-01-01 06:33:09 ┆ 2023-01-01 06:36:01 ┆ B3 │
│ 2023-01-01 06:34:08 ┆ 2023-01-01 06:39:49 ┆ C3 │
│ 2023-01-01 06:36:40 ┆ 2023-01-01 06:38:34 ┆ A6 │
│ 2023-01-01 06:37:43 ┆ 2023-01-01 06:40:48 ┆ A5 │
│ 2023-01-01 06:39:48 ┆ 2023-01-01 06:46:10 ┆ A6 │
└─────────────────────┴─────────────────────┴─────┘
""")
df = pl.concat(
df.with_columns(pl.col('arrival_time', 'departure_time').dt.offset_by(f'{5 * N}m'))
for N in range(1, 3000)
)
with Timer():
db.query("""
SELECT
A.arrival_time
,A.departure_time
,A.arrival_time - (INTERVAL 1 MINUTE) AS window_open
,A.departure_time + (INTERVAL 1 MINUTE) AS window_close
,LIST_DISTINCT(LIST(B.ID)) AS docked_trucks
,LIST_UNIQUE(LIST(B.ID)) AS docked_truck_count
FROM df A, df B
WHERE B.arrival_time <= window_close
AND B.departure_time >= window_open
GROUP BY 1, 2, 3, 4
""").pl() |
Yeah, very odd - thank you for confirming. Using the updated version, I get |
I just realized that The above approach finds the "outermost windows" and then filters them. It seems that to do it properly you would need to find the "inner windows" and do it separately for arrivals and departures. There is also seems to be the issue of "non-crossing" windows: (there's definitely a technical term for this: overlapping/intersecting?) df = pl.from_repr("""
shape: (12, 6)
┌─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┐
│ _id ┆ arrival_time ┆ departure_time ┆ ID ┆ window_open ┆ window_close │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ datetime[μs] ┆ datetime[μs] ┆ str ┆ datetime[μs] ┆ datetime[μs] │
╞═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╡
│ 0 ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1 ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 │
│ 1 ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1 ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 │
│ 2 ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5 ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 │
│ 3 ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6 ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 │
│ 4 ┆ 2023-01-01 06:38:09 ┆ 2023-01-01 06:41:01 ┆ B3 ┆ 2023-01-01 06:37:09 ┆ 2023-01-01 06:42:01 │
│ 5 ┆ 2023-01-01 06:39:08 ┆ 2023-01-01 06:44:49 ┆ C3 ┆ 2023-01-01 06:38:08 ┆ 2023-01-01 06:45:49 │
│ 6 ┆ 2023-01-01 06:41:40 ┆ 2023-01-01 06:43:34 ┆ A6 ┆ 2023-01-01 06:40:40 ┆ 2023-01-01 06:44:34 │
│ 7 ┆ 2023-01-01 06:42:43 ┆ 2023-01-01 06:45:48 ┆ A5 ┆ 2023-01-01 06:41:43 ┆ 2023-01-01 06:46:48 │
│ 8 ┆ 2023-01-01 06:44:48 ┆ 2023-01-01 06:51:10 ┆ A6 ┆ 2023-01-01 06:43:48 ┆ 2023-01-01 06:52:10 │
│ 9 ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1 ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 │
│ 10 ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1 ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 │
│ 11 ┆ 2023-01-01 06:40:20 ┆ 2023-01-01 06:45:01 ┆ A5 ┆ 2023-01-01 06:39:20 ┆ 2023-01-01 06:46:01 │
└─────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┘
""")
arrive = (
df.sort('arrival_time')
.with_columns(
start = pl.col('arrival_time').search_sorted('window_open'),
end = pl.col('arrival_time').search_sorted('window_close'),
)
.with_row_count()
)
# shape: (12, 9)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┐
# │ row_nr ┆ _id ┆ arrival_time ┆ departure_time ┆ ID ┆ window_open ┆ window_close ┆ start ┆ end │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ datetime[μs] ┆ datetime[μs] ┆ str ┆ datetime[μs] ┆ datetime[μs] ┆ u32 ┆ u32 │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╡
# │ 0 ┆ 0 ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1 ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0 ┆ 1 │
# │ 1 ┆ 1 ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1 ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1 ┆ 3 │
# │ 2 ┆ 9 ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1 ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 2 ┆ 4 │
# │ 3 ┆ 2 ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5 ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 3 ┆ 10 │
# │ 4 ┆ 10 ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1 ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 4 ┆ 8 │ # <-
# │ 5 ┆ 3 ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6 ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 4 ┆ 9 │ # <- Rows 4 and 5 here sort into position 4, however their windows are both fully contained inside the window of row 3 (but don't "touch") This means:
But it should be So it seems that grouping by the window this way cannot detect "parent windows", only when their boundaries actually "touch". (cross/intersect/overlap?) It appears we can look back and check if we have a parent window: previous = pl.when(pl.col('start') == 0).then(1).otherwise(pl.col('start')) - 1
is_sub_window = pl.all(
pl.col('arrival_time', 'departure_time') <= pl.col('departure_time').take(previous)
)
arrive = (
arrive.with_columns(
parent_id = pl.when(is_sub_window).then(pl.col('_id').take(previous)),
parent_departure_time =
pl.when(is_sub_window).then(pl.col('departure_time').take(previous))
)
)
# shape: (12, 11)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┬───────────┬───────────────────────┐
# │ row_nr ┆ _id ┆ arrival_time ┆ departure_time ┆ ID ┆ window_open ┆ window_close ┆ start ┆ end ┆ parent_id ┆ parent_departure_time │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ datetime[μs] ┆ datetime[μs] ┆ str ┆ datetime[μs] ┆ datetime[μs] ┆ u32 ┆ u32 ┆ u32 ┆ datetime[μs] │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╪═══════════╪═══════════════════════╡
# │ 0 ┆ 0 ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1 ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0 ┆ 1 ┆ 0 ┆ 2023-01-01 06:30:08 │
# │ 1 ┆ 1 ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1 ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1 ┆ 3 ┆ null ┆ null │
# │ 2 ┆ 9 ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1 ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 2 ┆ 4 ┆ null ┆ null │
# │ 3 ┆ 2 ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5 ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 3 ┆ 9 ┆ null ┆ null │
# │ 4 ┆ 10 ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1 ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 4 ┆ 7 ┆ 2 ┆ 2023-01-01 06:40:01 │ # <-
# │ 5 ┆ 3 ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6 ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 4 ┆ 8 ┆ 2 ┆ 2023-01-01 06:40:01 │ # <- We can then use the parent's departure time to ensure the windows "touch". depart = (
df.sort('departure_time')
.with_columns(
start = pl.col('departure_time').search_sorted('window_open'),
end = pl.col('departure_time').search_sorted('window_close')
)
.with_row_count()
)
depart = (
depart.join(
arrive.drop_nulls().select('_id', 'parent_id', 'parent_departure_time'),
on = '_id',
how = 'left'
)
.with_columns(parent_end =
pl.col('departure_time').search_sorted('parent_departure_time', side='right'))
.with_columns(end =
pl.when(pl.col('parent_id').is_not_null())
.then(pl.col('parent_end'))
.otherwise(pl.col('end')),
old_end = 'end'
)
)
# shape: (12, 13)
# ┌────────┬─────┬─────────────────────┬─────────────────────┬─────┬─────────────────────┬─────────────────────┬───────┬─────┬───────────┬───────────────────────┬────────────┬─────────┐
# │ row_nr ┆ _id ┆ arrival_time ┆ departure_time ┆ ID ┆ window_open ┆ window_close ┆ start ┆ end ┆ parent_id ┆ parent_departure_time ┆ parent_end ┆ old_end │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ u32 ┆ u32 ┆ datetime[μs] ┆ datetime[μs] ┆ str ┆ datetime[μs] ┆ datetime[μs] ┆ u32 ┆ u32 ┆ u32 ┆ datetime[μs] ┆ u32 ┆ u32 │
# ╞════════╪═════╪═════════════════════╪═════════════════════╪═════╪═════════════════════╪═════════════════════╪═══════╪═════╪═══════════╪═══════════════════════╪════════════╪═════════╡
# │ 0 ┆ 0 ┆ 2023-01-01 06:28:47 ┆ 2023-01-01 06:30:08 ┆ A1 ┆ 2023-01-01 06:27:47 ┆ 2023-01-01 06:31:08 ┆ 0 ┆ 1 ┆ 0 ┆ 2023-01-01 06:30:08 ┆ 1 ┆ 1 │
# │ 1 ┆ 1 ┆ 2023-01-01 06:31:42 ┆ 2023-01-01 06:33:02 ┆ A1 ┆ 2023-01-01 06:30:42 ┆ 2023-01-01 06:34:02 ┆ 1 ┆ 2 ┆ null ┆ null ┆ 0 ┆ 2 │
# │ 2 ┆ 9 ┆ 2023-01-01 06:33:47 ┆ 2023-01-01 06:35:08 ┆ A1 ┆ 2023-01-01 06:32:47 ┆ 2023-01-01 06:36:08 ┆ 1 ┆ 3 ┆ null ┆ null ┆ 0 ┆ 3 │
# │ 3 ┆ 10 ┆ 2023-01-01 06:36:42 ┆ 2023-01-01 06:38:02 ┆ A1 ┆ 2023-01-01 06:35:42 ┆ 2023-01-01 06:39:02 ┆ 3 ┆ 6 ┆ 2 ┆ 2023-01-01 06:40:01 ┆ 6 ┆ 5 │ # <-
# │ 4 ┆ 3 ┆ 2023-01-01 06:37:06 ┆ 2023-01-01 06:38:48 ┆ A6 ┆ 2023-01-01 06:36:06 ┆ 2023-01-01 06:39:48 ┆ 3 ┆ 6 ┆ 2 ┆ 2023-01-01 06:40:01 ┆ 6 ┆ 5 │ # <-
# │ 5 ┆ 2 ┆ 2023-01-01 06:35:20 ┆ 2023-01-01 06:40:01 ┆ A5 ┆ 2023-01-01 06:34:20 ┆ 2023-01-01 06:41:01 ┆ 2 ┆ 6 ┆ null ┆ null ┆ 0 ┆ 6 │
# │ 6 ┆ 4 ┆ 2023-01-01 06:38:09 ┆ 2023-01-01 06:41:01 ┆ B3 ┆ 2023-01-01 06:37:09 ┆ 2023-01-01 06:42:01 ┆ 3 ┆ 7 ┆ null ┆ null ┆ 0 ┆ 7 │
# │ 7 ┆ 6 ┆ 2023-01-01 06:41:40 ┆ 2023-01-01 06:43:34 ┆ A6 ┆ 2023-01-01 06:40:40 ┆ 2023-01-01 06:44:34 ┆ 6 ┆ 10 ┆ 11 ┆ 2023-01-01 06:45:01 ┆ 10 ┆ 8 │ # <-
# │ 8 ┆ 5 ┆ 2023-01-01 06:39:08 ┆ 2023-01-01 06:44:49 ┆ C3 ┆ 2023-01-01 06:38:08 ┆ 2023-01-01 06:45:49 ┆ 4 ┆ 11 ┆ null ┆ null ┆ 0 ┆ 11 │
# │ 9 ┆ 11 ┆ 2023-01-01 06:40:20 ┆ 2023-01-01 06:45:01 ┆ A5 ┆ 2023-01-01 06:39:20 ┆ 2023-01-01 06:46:01 ┆ 5 ┆ 11 ┆ null ┆ null ┆ 0 ┆ 11 │
# │ 10 ┆ 7 ┆ 2023-01-01 06:42:43 ┆ 2023-01-01 06:45:48 ┆ A5 ┆ 2023-01-01 06:41:43 ┆ 2023-01-01 06:46:48 ┆ 7 ┆ 11 ┆ null ┆ null ┆ 0 ┆ 11 │
# │ 11 ┆ 8 ┆ 2023-01-01 06:44:48 ┆ 2023-01-01 06:51:10 ┆ A6 ┆ 2023-01-01 06:43:48 ┆ 2023-01-01 06:52:10 ┆ 8 ┆ 12 ┆ null ┆ null ┆ 0 ┆ 12 │
# └────────┴─────┴─────────────────────┴─────────────────────┴─────┴─────────────────────┴─────────────────────┴───────┴─────┴───────────┴───────────────────────┴────────────┴─────────┘ These adjusted start/end positions now give the exact group slice. Testing it out with a few different inputs generates the same output as the earlier polars and duckdb attempts so it seems to be equivalent (unless I've missed something that doesn't show up in this specific example data). Although @areeh's approach of using actual times instead of slicing row numbers makes much more sense #9691 |
Problem description
Inspired by https://stackoverflow.com/questions/76488314/polars-count-unique-values-over-a-time-period
Essentially I'm wondering if it would be possible to do something like:
The original question was about counting within time windows, in which case the approach was to get the start/end slices.
Expand via
.arange(start, end)
and into columns withlist.to_struct()
And use
.take('^field_\d+$')
to get the valuesWhich seems to work okay and is decently fast (although it probably starts to suffer when the number of
field_
columns starts to get sufficiently large.)I'm not sure if there are any other potential use-cases for being able to specify slices?
It's also possible there are better ways to solve this type of problem?
Thanks all.
The text was updated successfully, but these errors were encountered: