-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obtain multiple ranges of data (for aggregation) from a huge dataset - similar to R's foverlaps() in the data.table package #9481
Comments
This is essentially the same as #9467 and #5891. In your case it is probably much faster to find which row indices in As I commented in #9467, in my own applications (~500M rows of data, ~50M intervals each spanning ~10K rows), it was much faster to first find the slice indices using polars, but then iterate + aggregate over the data slices with numba. In principle, one might also be able to do this on the rust side by constructing a custom |
Oh wow, @erinov1, I did not catch the link between |
Thank you for giving me the idea of using |
This is the approach mentioned in #9467 incase you want to try it. You explode the range of slices and use ts = pl.scan_parquet('flo_ts.parquet')
lookup = pl.scan_parquet('events_of_interests.parquet')
slices = ts.with_context(lookup).select(
'ID',
'Start',
'End',
idx = pl.arange(
pl.col('Time').search_sorted('Start'),
pl.col('Time').search_sorted('End')
)
)
(ts.with_context(slices.explode('idx'))
.select('ID', 'Start', 'End', pl.col('Time', 'Flow').take('idx'))
.groupby('ID').agg(
pl.col('Flow').max().alias('Max'),
pl.col('Time').sort_by('Flow', descending=True).first().alias('Time_max'),
)
).collect()
|
Wow, this is awesome! I like this one. The hard part for me is ...
pl.arange(
pl.col('Time').search_sorted('Start'),
pl.col('Time').search_sorted('End')
)
... I eventually see what it does after I break it and check the DOC. |
Yeah, it explodes out the range - it seems to perform well enough in some of these cases. If I try it with a larger version of @erinov1 's #5891 example though it gets real slow and tries to use huge amounts of memory. I guess it's not an efficient/scalable approach for when there are large amounts of overlapping ranges. |
Thank you for explanation about its use case, cmdlineluser. Hopefully, Polars could have a similar function like |
I hit the same situation a while back and wrote some code inside Polars to get it to work (See: #9691, bear in mind it probably has errors and was copying from the groupby_dynamic/rolling functions as they were months ago). I'd love to see similar functionality added to Polars properly |
I think the new (lookup
.join_where(ts,
pl.col.Time >= pl.col.Start,
pl.col.Time <= pl.col.End
)
.group_by(pl.col.ID)
.agg(pl.all().get(pl.col.Flow.arg_max()))
)
# shape: (4, 5)
# ┌─────┬──────────────────────────┬──────────────────────────┬──────────────────────────┬────────┐
# │ ID ┆ Start ┆ End ┆ Time ┆ Flow │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i32 ┆ datetime[μs, Etc/GMT-12] ┆ datetime[μs, Etc/GMT-12] ┆ datetime[μs, Etc/GMT-12] ┆ f64 │
# ╞═════╪══════════════════════════╪══════════════════════════╪══════════════════════════╪════════╡
# │ 3 ┆ 2014-05-23 09:28:00 +12 ┆ 2014-06-03 07:01:00 +12 ┆ 2014-05-29 05:00:00 +12 ┆ 24.522 │
# │ 4 ┆ 2014-06-25 00:43:00 +12 ┆ 2014-06-30 23:47:00 +12 ┆ 2014-06-25 12:15:00 +12 ┆ 117.61 │
# │ 1 ┆ 2013-07-02 10:24:00 +12 ┆ 2013-10-13 15:05:00 +12 ┆ 2013-07-06 04:20:00 +12 ┆ 93.504 │
# │ 2 ┆ 2013-10-13 22:44:00 +12 ┆ 2013-11-18 18:16:00 +12 ┆ 2013-10-24 08:00:00 +12 ┆ 23.561 │
# └─────┴──────────────────────────┴──────────────────────────┴──────────────────────────┴────────┘ |
Thank you for the nice solution. I didn't know that method |
Problem description
I would like to use Polars to perform some analysis on multiple small ranges of data from a huge dataset. I will provide example data for a quick demo of my goal.
Now, I would like to get the results shown below - have the Maximum and Time to Maximum for each event
ID
in thelookup
DataFrame:Before I can get the above results, I need to assign the
ID
to the input flow time series based on theStart
andEnd
of each event - so that I could aggregate on theFlow
values by each correspondingID
. For the moment, based on my understanding of Polars module. It would be very easy for me to write a loop like follows:Now I can aggregate on the Flow values to get what I need:
I only used a small portion of my dataset, and the lookup table's height is only 4. So the loop is fine. But I sometimes have 10K events needed from close to 2 million values of the input time series, that is a problem! I have tried around 5k loops to do the aggregation at the same time (see here for an example), but it ended up with the terminal window being frozen...
I would like to request a Polars function/method - to achieve this, a similar function like R's
foverlaps()
in thedata.table
package. This tool can clip out all the needed time series (in my case) in a single operation. It is so fast and convenient (You can find example data and R codes here).Many thanks to the Polars developers for making such a useful tool.
The text was updated successfully, but these errors were encountered: