Join between Polars dataframes with inequality conditions #6856

andrearota · 2023-02-13T15:57:18Z

Research

I have searched the above polars tags on Stack Overflow for similar questions.
I have asked my usage related question on Stack Overflow.

Link to question on Stack Overflow

https://stackoverflow.com/questions/75399353/join-between-polars-dataframes-with-inequality-conditions

Question about Polars

I reopen the issue as the previous #6753 was closed when a wrong answer on SO was given.

The main point, as one can read in the comments on SO is that Polars seems to do not support one-to-many join operations based on an inequality condition.

andrearota · 2023-02-13T15:59:24Z

Maybe the issue #4207 seems to clarify that joins on custom conditions are not supported.

mcrumiller · 2023-02-13T23:18:27Z

Edit: updated to use cross product.

Because your issue involves a specific JOIN clause, here is a hacky workaround specific to your question that gives the correct result:

import polars as pl
from polars import col
from datetime import date

stock_market_value = pl.DataFrame({
    "market_date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
    "price": [10.00, 12.00, 14.00]
})

stock_market_orders = pl.DataFrame({
    "order_date": [date(2022, 1, 15), date(2022, 2, 15)],
    "quantity": [2, 5]
})

# use a backwards join-asof to find rows in market_value that have no rows in orders with order date < market date
stock_market_value = stock_market_value.with_columns(
        stock_market_value.join_asof(
        stock_market_orders,
        left_on="market_date",
        right_on="order_date",
    )["order_date"].is_not_null().alias("has_match")
)
nonmatched_rows = stock_market_value.filter(col("has_match")==False).drop("has_match")

# keep all other rows and perform a cartesian product
matched_rows = stock_market_value.filter(col("has_match")==True).drop("has_match")
df = matched_rows.join(stock_market_orders, how="cross")

# filter based on our join condition
df = df.filter(col("market_date") > col("order_date"))

# concatenate the unmatched with the filtered result for our final answer
df = pl.concat((nonmatched_rows, df), how="diagonal")

print(df)

Output:

shape: (4, 4)
┌─────────────┬───────┬────────────┬──────────┐
│ market_date ┆ price ┆ order_date ┆ quantity │
│ ---         ┆ ---   ┆ ---        ┆ ---      │
│ date        ┆ f64   ┆ date       ┆ i64      │
╞═════════════╪═══════╪════════════╪══════════╡
│ 2022-01-01  ┆ 10.0  ┆ null       ┆ null     │
│ 2022-02-01  ┆ 12.0  ┆ 2022-01-15 ┆ 2        │
│ 2022-03-01  ┆ 14.0  ┆ 2022-01-15 ┆ 2        │
│ 2022-03-01  ┆ 14.0  ┆ 2022-02-15 ┆ 5        │
└─────────────┴───────┴────────────┴──────────┘

thomasfrederikhoeck · 2023-02-14T06:28:06Z

Isn't this a duplicate of #4206 ?

Hoeze · 2023-02-27T12:34:24Z

@thomasfrederikhoeck #4206 sounds more like a request for interval joins.
This issue is explicitly for inequality joins:

SELECT m.date, m.price * o.quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
  ON m.date >= o.date

thomasfrederikhoeck · 2023-02-27T13:59:17Z

@Hoeze ah my bad. Thanks for clarifying.

evbo · 2023-04-18T17:08:42Z

What about str inequality, such as starts_with. Could there be a way to enable apply in the context of inner joins so that you could do something like the OP has suggested or even more general?:

df1.join(df2, how='inner').apply(lambda x: x[0].starts_with(x[1]))

Hoeze · 2023-05-15T18:38:50Z

Are there any optimizations on cross joins with inequality filters?

gab23r · 2023-05-15T19:05:47Z

I remember this tweet from @ritchie46
https://twitter.com/RitchieVink/status/1587098941430145025?t=b60DEHNbTqkBlotXpSxqBg&s=19

avimallu · 2023-05-16T01:46:20Z

For now, you can also use DuckDB and immediately convert to and back from Polars, in case you aren't willing to go the streaming route.

fburic · 2023-05-23T10:35:38Z

In case the polars query planner doesn't optimize cross with inequality filters, there's a way to avoid a Cartesian, by sorting one of the join columns (ideally the longer) - call it B. Then, for each entry in A, a lookup for the first match takes log B. Since B is sorted, one can in the same step collect all e.g. higher values following this first match. So, in total log B + Q for each entry in A, where Q = n. of matches in B.

It will still be quadratic in the worst case where all A < B (or vice-versa for ">"), but less than quadratic otherwise. One can also sort A, but in my experience it doesn't improve things much - just adds another sorting step and increases code complexity,

I'm new to polars so I unfortunately can't suggest a solution for the above.
But I wrote this for the Pandas ecosystem, should anyone want to implement it for polars: https://pandance.readthedocs.io/en/latest/operations/ineq_join.html#pandance.ineq_join (a more detailed time complexity description is given there).

Hoeze · 2023-05-23T10:59:31Z

There is a recent article how to speed up complex SQL inequality joins with binning + approx_percentile:
https://select.dev/posts/snowflake-range-join-optimization

cmdlineluser · 2023-07-01T23:33:22Z

Perhaps someone can confirm if this is equivalent:

import polars as pl

stock_market_value = pl.DataFrame({
    "market_date": [pl.date(2022, 1, 1), pl.date(2022, 2, 1), pl.date(2022, 3, 1)],
    "price": [10.00, 12.00, 14.00]
})

stock_market_orders = pl.DataFrame({
    "order_date": [pl.date(2022, 1, 15), pl.date(2022, 2, 15)],
    "quantity": [2, 5]
})

stock_market_value = stock_market_value.lazy()
my_stock_orders = my_stock_orders.select('quantity', order_date = 'date').lazy()

slices = (
   stock_market_value
     .with_context(my_stock_orders)
     .with_columns(start = 0)
     .with_columns(
        market_date = 'date',
        idx = pl.arange('start', pl.col('order_date').search_sorted('date', side='right'))
      )
      .explode('idx')
)

(
   my_stock_orders.with_context(slices)
     .select(
        'market_date', 'price', pl.col('order_date', 'quantity').take('idx')
     )
     .with_columns(
        portfolio_value = pl.col('price') * pl.col('quantity')
     )
     .collect()
)

shape: (4, 5)
┌─────────────┬───────┬────────────┬──────────┬─────────────────┐
│ market_date ┆ price ┆ order_date ┆ quantity ┆ portfolio_value │
│ ---         ┆ ---   ┆ ---        ┆ ---      ┆ ---             │
│ date        ┆ f64   ┆ date       ┆ i64      ┆ f64             │
╞═════════════╪═══════╪════════════╪══════════╪═════════════════╡
│ 2022-01-01  ┆ 10.0  ┆ null       ┆ null     ┆ null            │
│ 2022-02-01  ┆ 12.0  ┆ 2022-01-15 ┆ 2        ┆ 24.0            │
│ 2022-03-01  ┆ 14.0  ┆ 2022-01-15 ┆ 2        ┆ 28.0            │
│ 2022-03-01  ┆ 14.0  ┆ 2022-02-15 ┆ 5        ┆ 70.0            │
└─────────────┴───────┴────────────┴──────────┴─────────────────┘

Or in a more general example:

df_a = pl.DataFrame(dict(A=[1, 2, 3], B=[5, 4, 3])).lazy()
df_b = pl.DataFrame(dict(C=[1, 2, 3, 4, 5], D=[11, 12, 13, 14, 15])).lazy()

# shape: (3, 2)    # shape: (5, 2)
# ┌─────┬─────┐    # ┌─────┬─────┐
# │ A   ┆ B   │    # │ C   ┆ D   │
# │ --- ┆ --- │    # │ --- ┆ --- │
# │ i64 ┆ i64 │    # │ i64 ┆ i64 │
# ╞═════╪═════╡    # ╞═════╪═════╡
# │ 1   ┆ 5   │    # │ 1   ┆ 11  │
# │ 2   ┆ 4   │    # │ 2   ┆ 12  │
# │ 3   ┆ 3   │    # │ 3   ┆ 13  │
# └─────┴─────┘    # │ 4   ┆ 14  │
                   # │ 5   ┆ 15  │
                   # └─────┴─────┘

df_a.A > df_b.C

slices = (
   df_a.sort('A')
       .with_context(df_b.sort('C'))
       .with_columns(start = 0)
       .with_columns(idx = pl.arange('start', pl.col('C').search_sorted('A', side='left')))
       .explode('idx')
)

df_b.with_context(slices).select(
   pl.col('A', 'B'), pl.col('C', 'D').take('idx')
).collect()

# shape: (4, 4)
# ┌─────┬─────┬──────┬──────┐
# │ A   ┆ B   ┆ C    ┆ D    │
# │ --- ┆ --- ┆ ---  ┆ ---  │
# │ i64 ┆ i64 ┆ i64  ┆ i64  │
# ╞═════╪═════╪══════╪══════╡
# │ 1   ┆ 5   ┆ null ┆ null │
# │ 2   ┆ 4   ┆ 1    ┆ 11   │
# │ 3   ┆ 3   ┆ 1    ┆ 11   │
# │ 3   ┆ 3   ┆ 2    ┆ 12   │
# └─────┴─────┴──────┴──────┘

Specifying side='right' in the .search_sorted() would be "or equals to".

We can go in the other direction for df_a.A < df_b.C

slices = (
   df_b.sort('C')
       .with_context(df_a.sort('A'))
       .with_columns(start = 0) 
       .with_columns(
          idx = pl.arange('start', pl.col('A').search_sorted('C', side='left'))
       )
       .explode('idx')
)

df_a.with_context(slices).select(
   pl.col('A', 'B').take('idx'), pl.col('C', 'D')
).collect()

# shape: (10, 4)
# ┌──────┬──────┬─────┬─────┐
# │ A    ┆ B    ┆ C   ┆ D   │
# │ ---  ┆ ---  ┆ --- ┆ --- │
# │ i64  ┆ i64  ┆ i64 ┆ i64 │
# ╞══════╪══════╪═════╪═════╡
# │ null ┆ null ┆ 1   ┆ 11  │
# │ 1    ┆ 5    ┆ 2   ┆ 12  │
# │ 1    ┆ 5    ┆ 3   ┆ 13  │
# │ 2    ┆ 4    ┆ 3   ┆ 13  │
# │ 1    ┆ 5    ┆ 4   ┆ 14  │
# │ 2    ┆ 4    ┆ 4   ┆ 14  │
# │ 3    ┆ 3    ┆ 4   ┆ 14  │
# │ 1    ┆ 5    ┆ 5   ┆ 15  │
# │ 2    ┆ 4    ┆ 5   ┆ 15  │
# │ 3    ┆ 3    ┆ 5   ┆ 15  │
# └──────┴──────┴─────┴─────┘

Comparing these examples to duckdb generates the same results.

But perhaps there's more to it and this just solves a simplified version of the problem?

stinodego · 2024-01-11T13:45:39Z

Closing in favor of #10068

andrearota added the question label Feb 13, 2023

mcrumiller mentioned this issue Feb 13, 2023

Add cartesian product #6863

Closed

stinodego added enhancement New feature or an improvement of an existing feature and removed question labels Aug 22, 2023

Hoeze mentioned this issue Sep 27, 2023

Non-equi join tracking issue #10068

Closed

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join between Polars dataframes with inequality conditions #6856

Join between Polars dataframes with inequality conditions #6856

andrearota commented Feb 13, 2023

andrearota commented Feb 13, 2023

mcrumiller commented Feb 13, 2023 •

edited

Loading

thomasfrederikhoeck commented Feb 14, 2023

Hoeze commented Feb 27, 2023

thomasfrederikhoeck commented Feb 27, 2023

evbo commented Apr 18, 2023

Hoeze commented May 15, 2023

gab23r commented May 15, 2023

avimallu commented May 16, 2023 •

edited

Loading

fburic commented May 23, 2023

Hoeze commented May 23, 2023 •

edited

Loading

cmdlineluser commented Jul 1, 2023 •

edited

Loading

stinodego commented Jan 11, 2024

Join between Polars dataframes with inequality conditions #6856

Join between Polars dataframes with inequality conditions #6856

Comments

andrearota commented Feb 13, 2023

Research

Link to question on Stack Overflow

Question about Polars

andrearota commented Feb 13, 2023

mcrumiller commented Feb 13, 2023 • edited Loading

thomasfrederikhoeck commented Feb 14, 2023

Hoeze commented Feb 27, 2023

thomasfrederikhoeck commented Feb 27, 2023

evbo commented Apr 18, 2023

Hoeze commented May 15, 2023

gab23r commented May 15, 2023

avimallu commented May 16, 2023 • edited Loading

fburic commented May 23, 2023

Hoeze commented May 23, 2023 • edited Loading

cmdlineluser commented Jul 1, 2023 • edited Loading

stinodego commented Jan 11, 2024

mcrumiller commented Feb 13, 2023 •

edited

Loading

avimallu commented May 16, 2023 •

edited

Loading

Hoeze commented May 23, 2023 •

edited

Loading

cmdlineluser commented Jul 1, 2023 •

edited

Loading