-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join between Polars dataframes with inequality conditions #6856
Comments
Maybe the issue #4207 seems to clarify that joins on custom conditions are not supported. |
Edit: updated to use cross product. Because your issue involves a specific import polars as pl
from polars import col
from datetime import date
stock_market_value = pl.DataFrame({
"market_date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
})
stock_market_orders = pl.DataFrame({
"order_date": [date(2022, 1, 15), date(2022, 2, 15)],
"quantity": [2, 5]
})
# use a backwards join-asof to find rows in market_value that have no rows in orders with order date < market date
stock_market_value = stock_market_value.with_columns(
stock_market_value.join_asof(
stock_market_orders,
left_on="market_date",
right_on="order_date",
)["order_date"].is_not_null().alias("has_match")
)
nonmatched_rows = stock_market_value.filter(col("has_match")==False).drop("has_match")
# keep all other rows and perform a cartesian product
matched_rows = stock_market_value.filter(col("has_match")==True).drop("has_match")
df = matched_rows.join(stock_market_orders, how="cross")
# filter based on our join condition
df = df.filter(col("market_date") > col("order_date"))
# concatenate the unmatched with the filtered result for our final answer
df = pl.concat((nonmatched_rows, df), how="diagonal")
print(df) Output:
|
Isn't this a duplicate of #4206 ? |
@thomasfrederikhoeck #4206 sounds more like a request for interval joins. SELECT m.date, m.price * o.quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
ON m.date >= o.date |
@Hoeze ah my bad. Thanks for clarifying. |
What about df1.join(df2, how='inner').apply(lambda x: x[0].starts_with(x[1])) |
Are there any optimizations on |
I remember this tweet from @ritchie46 |
For now, you can also use DuckDB and immediately convert to and back from Polars, in case you aren't willing to go the streaming route. |
In case the polars query planner doesn't optimize It will still be quadratic in the worst case where all A < B (or vice-versa for ">"), but less than quadratic otherwise. One can also sort A, but in my experience it doesn't improve things much - just adds another sorting step and increases code complexity, I'm new to polars so I unfortunately can't suggest a solution for the above. |
There is a recent article how to speed up complex SQL inequality joins with binning + |
Perhaps someone can confirm if this is equivalent: import polars as pl
stock_market_value = pl.DataFrame({
"market_date": [pl.date(2022, 1, 1), pl.date(2022, 2, 1), pl.date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
})
stock_market_orders = pl.DataFrame({
"order_date": [pl.date(2022, 1, 15), pl.date(2022, 2, 15)],
"quantity": [2, 5]
})
stock_market_value = stock_market_value.lazy()
my_stock_orders = my_stock_orders.select('quantity', order_date = 'date').lazy()
slices = (
stock_market_value
.with_context(my_stock_orders)
.with_columns(start = 0)
.with_columns(
market_date = 'date',
idx = pl.arange('start', pl.col('order_date').search_sorted('date', side='right'))
)
.explode('idx')
)
(
my_stock_orders.with_context(slices)
.select(
'market_date', 'price', pl.col('order_date', 'quantity').take('idx')
)
.with_columns(
portfolio_value = pl.col('price') * pl.col('quantity')
)
.collect()
)
Or in a more general example: df_a = pl.DataFrame(dict(A=[1, 2, 3], B=[5, 4, 3])).lazy()
df_b = pl.DataFrame(dict(C=[1, 2, 3, 4, 5], D=[11, 12, 13, 14, 15])).lazy()
# shape: (3, 2) # shape: (5, 2)
# ┌─────┬─────┐ # ┌─────┬─────┐
# │ A ┆ B │ # │ C ┆ D │
# │ --- ┆ --- │ # │ --- ┆ --- │
# │ i64 ┆ i64 │ # │ i64 ┆ i64 │
# ╞═════╪═════╡ # ╞═════╪═════╡
# │ 1 ┆ 5 │ # │ 1 ┆ 11 │
# │ 2 ┆ 4 │ # │ 2 ┆ 12 │
# │ 3 ┆ 3 │ # │ 3 ┆ 13 │
# └─────┴─────┘ # │ 4 ┆ 14 │
# │ 5 ┆ 15 │
# └─────┴─────┘
slices = (
df_a.sort('A')
.with_context(df_b.sort('C'))
.with_columns(start = 0)
.with_columns(idx = pl.arange('start', pl.col('C').search_sorted('A', side='left')))
.explode('idx')
)
df_b.with_context(slices).select(
pl.col('A', 'B'), pl.col('C', 'D').take('idx')
).collect()
# shape: (4, 4)
# ┌─────┬─────┬──────┬──────┐
# │ A ┆ B ┆ C ┆ D │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪══════╪══════╡
# │ 1 ┆ 5 ┆ null ┆ null │
# │ 2 ┆ 4 ┆ 1 ┆ 11 │
# │ 3 ┆ 3 ┆ 1 ┆ 11 │
# │ 3 ┆ 3 ┆ 2 ┆ 12 │
# └─────┴─────┴──────┴──────┘ Specifying We can go in the other direction for slices = (
df_b.sort('C')
.with_context(df_a.sort('A'))
.with_columns(start = 0)
.with_columns(
idx = pl.arange('start', pl.col('A').search_sorted('C', side='left'))
)
.explode('idx')
)
df_a.with_context(slices).select(
pl.col('A', 'B').take('idx'), pl.col('C', 'D')
).collect()
# shape: (10, 4)
# ┌──────┬──────┬─────┬─────┐
# │ A ┆ B ┆ C ┆ D │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞══════╪══════╪═════╪═════╡
# │ null ┆ null ┆ 1 ┆ 11 │
# │ 1 ┆ 5 ┆ 2 ┆ 12 │
# │ 1 ┆ 5 ┆ 3 ┆ 13 │
# │ 2 ┆ 4 ┆ 3 ┆ 13 │
# │ 1 ┆ 5 ┆ 4 ┆ 14 │
# │ 2 ┆ 4 ┆ 4 ┆ 14 │
# │ 3 ┆ 3 ┆ 4 ┆ 14 │
# │ 1 ┆ 5 ┆ 5 ┆ 15 │
# │ 2 ┆ 4 ┆ 5 ┆ 15 │
# │ 3 ┆ 3 ┆ 5 ┆ 15 │
# └──────┴──────┴─────┴─────┘ Comparing these examples to duckdb generates the same results. But perhaps there's more to it and this just solves a simplified version of the problem? |
Closing in favor of #10068 |
Research
I have searched the above polars tags on Stack Overflow for similar questions.
I have asked my usage related question on Stack Overflow.
Link to question on Stack Overflow
https://stackoverflow.com/questions/75399353/join-between-polars-dataframes-with-inequality-conditions
Question about Polars
I reopen the issue as the previous #6753 was closed when a wrong answer on SO was given.
The main point, as one can read in the comments on SO is that Polars seems to do not support one-to-many join operations based on an inequality condition.
The text was updated successfully, but these errors were encountered: