Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join between Polars dataframes with inequality conditions #6856

Closed
2 tasks done
andrearota opened this issue Feb 13, 2023 · 13 comments
Closed
2 tasks done

Join between Polars dataframes with inequality conditions #6856

andrearota opened this issue Feb 13, 2023 · 13 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@andrearota
Copy link

Research

  • I have searched the above polars tags on Stack Overflow for similar questions.

  • I have asked my usage related question on Stack Overflow.

Link to question on Stack Overflow

https://stackoverflow.com/questions/75399353/join-between-polars-dataframes-with-inequality-conditions

Question about Polars

I reopen the issue as the previous #6753 was closed when a wrong answer on SO was given.

The main point, as one can read in the comments on SO is that Polars seems to do not support one-to-many join operations based on an inequality condition.

@andrearota
Copy link
Author

Maybe the issue #4207 seems to clarify that joins on custom conditions are not supported.

@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 13, 2023

Edit: updated to use cross product.

Because your issue involves a specific JOIN clause, here is a hacky workaround specific to your question that gives the correct result:

import polars as pl
from polars import col
from datetime import date

stock_market_value = pl.DataFrame({
    "market_date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
    "price": [10.00, 12.00, 14.00]
})

stock_market_orders = pl.DataFrame({
    "order_date": [date(2022, 1, 15), date(2022, 2, 15)],
    "quantity": [2, 5]
})

# use a backwards join-asof to find rows in market_value that have no rows in orders with order date < market date
stock_market_value = stock_market_value.with_columns(
        stock_market_value.join_asof(
        stock_market_orders,
        left_on="market_date",
        right_on="order_date",
    )["order_date"].is_not_null().alias("has_match")
)
nonmatched_rows = stock_market_value.filter(col("has_match")==False).drop("has_match")

# keep all other rows and perform a cartesian product
matched_rows = stock_market_value.filter(col("has_match")==True).drop("has_match")
df = matched_rows.join(stock_market_orders, how="cross")

# filter based on our join condition
df = df.filter(col("market_date") > col("order_date"))

# concatenate the unmatched with the filtered result for our final answer
df = pl.concat((nonmatched_rows, df), how="diagonal")

print(df)

Output:

shape: (4, 4)
┌─────────────┬───────┬────────────┬──────────┐
│ market_date ┆ price ┆ order_date ┆ quantity │
│ ---         ┆ ---   ┆ ---        ┆ ---      │
│ date        ┆ f64   ┆ date       ┆ i64      │
╞═════════════╪═══════╪════════════╪══════════╡
│ 2022-01-01  ┆ 10.0  ┆ null       ┆ null     │
│ 2022-02-01  ┆ 12.0  ┆ 2022-01-15 ┆ 2        │
│ 2022-03-01  ┆ 14.0  ┆ 2022-01-15 ┆ 2        │
│ 2022-03-01  ┆ 14.0  ┆ 2022-02-15 ┆ 5        │
└─────────────┴───────┴────────────┴──────────┘

@thomasfrederikhoeck
Copy link
Contributor

Isn't this a duplicate of #4206 ?

@Hoeze
Copy link

Hoeze commented Feb 27, 2023

@thomasfrederikhoeck #4206 sounds more like a request for interval joins.
This issue is explicitly for inequality joins:

SELECT m.date, m.price * o.quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
  ON m.date >= o.date

@thomasfrederikhoeck
Copy link
Contributor

@Hoeze ah my bad. Thanks for clarifying.

@evbo
Copy link

evbo commented Apr 18, 2023

What about str inequality, such as starts_with. Could there be a way to enable apply in the context of inner joins so that you could do something like the OP has suggested or even more general?:

df1.join(df2, how='inner').apply(lambda x: x[0].starts_with(x[1]))

@Hoeze
Copy link

Hoeze commented May 15, 2023

Are there any optimizations on cross joins with inequality filters?

@gab23r
Copy link
Contributor

gab23r commented May 15, 2023

@avimallu
Copy link
Contributor

avimallu commented May 16, 2023

For now, you can also use DuckDB and immediately convert to and back from Polars, in case you aren't willing to go the streaming route.

@fburic
Copy link

fburic commented May 23, 2023

In case the polars query planner doesn't optimize cross with inequality filters, there's a way to avoid a Cartesian, by sorting one of the join columns (ideally the longer) - call it B. Then, for each entry in A, a lookup for the first match takes log B. Since B is sorted, one can in the same step collect all e.g. higher values following this first match. So, in total log B + Q for each entry in A, where Q = n. of matches in B.

It will still be quadratic in the worst case where all A < B (or vice-versa for ">"), but less than quadratic otherwise. One can also sort A, but in my experience it doesn't improve things much - just adds another sorting step and increases code complexity,

I'm new to polars so I unfortunately can't suggest a solution for the above.
But I wrote this for the Pandas ecosystem, should anyone want to implement it for polars: https://pandance.readthedocs.io/en/latest/operations/ineq_join.html#pandance.ineq_join (a more detailed time complexity description is given there).

@Hoeze
Copy link

Hoeze commented May 23, 2023

There is a recent article how to speed up complex SQL inequality joins with binning + approx_percentile:
https://select.dev/posts/snowflake-range-join-optimization

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jul 1, 2023

Perhaps someone can confirm if this is equivalent:

import polars as pl

stock_market_value = pl.DataFrame({
    "market_date": [pl.date(2022, 1, 1), pl.date(2022, 2, 1), pl.date(2022, 3, 1)],
    "price": [10.00, 12.00, 14.00]
})

stock_market_orders = pl.DataFrame({
    "order_date": [pl.date(2022, 1, 15), pl.date(2022, 2, 15)],
    "quantity": [2, 5]
})

stock_market_value = stock_market_value.lazy()
my_stock_orders = my_stock_orders.select('quantity', order_date = 'date').lazy()

slices = (
   stock_market_value
     .with_context(my_stock_orders)
     .with_columns(start = 0)
     .with_columns(
        market_date = 'date',
        idx = pl.arange('start', pl.col('order_date').search_sorted('date', side='right'))
      )
      .explode('idx')
)

(
   my_stock_orders.with_context(slices)
     .select(
        'market_date', 'price', pl.col('order_date', 'quantity').take('idx')
     )
     .with_columns(
        portfolio_value = pl.col('price') * pl.col('quantity')
     )
     .collect()
)
shape: (4, 5)
┌─────────────┬───────┬────────────┬──────────┬─────────────────┐
│ market_date ┆ price ┆ order_date ┆ quantity ┆ portfolio_value │
│ ---         ┆ ---   ┆ ---        ┆ ---      ┆ ---             │
│ date        ┆ f64   ┆ date       ┆ i64      ┆ f64             │
╞═════════════╪═══════╪════════════╪══════════╪═════════════════╡
│ 2022-01-01  ┆ 10.0  ┆ null       ┆ null     ┆ null            │
│ 2022-02-01  ┆ 12.0  ┆ 2022-01-15 ┆ 2        ┆ 24.0            │
│ 2022-03-01  ┆ 14.0  ┆ 2022-01-15 ┆ 2        ┆ 28.0            │
│ 2022-03-01  ┆ 14.0  ┆ 2022-02-15 ┆ 5        ┆ 70.0            │
└─────────────┴───────┴────────────┴──────────┴─────────────────┘

Or in a more general example:

df_a = pl.DataFrame(dict(A=[1, 2, 3], B=[5, 4, 3])).lazy()
df_b = pl.DataFrame(dict(C=[1, 2, 3, 4, 5], D=[11, 12, 13, 14, 15])).lazy()

# shape: (3, 2)    # shape: (5, 2)
# ┌─────┬─────┐    # ┌─────┬─────┐
# │ A   ┆ B   │    # │ C   ┆ D   │
# │ --- ┆ --- │    # │ --- ┆ --- │
# │ i64 ┆ i64 │    # │ i64 ┆ i64 │
# ╞═════╪═════╡    # ╞═════╪═════╡
# │ 1   ┆ 5   │    # │ 1   ┆ 11  │
# │ 2   ┆ 4   │    # │ 2   ┆ 12  │
# │ 3   ┆ 3   │    # │ 3   ┆ 13  │
# └─────┴─────┘    # │ 4   ┆ 14  │
                   # │ 5   ┆ 15  │
                   # └─────┴─────┘

df_a.A > df_b.C

slices = (
   df_a.sort('A')
       .with_context(df_b.sort('C'))
       .with_columns(start = 0)
       .with_columns(idx = pl.arange('start', pl.col('C').search_sorted('A', side='left')))
       .explode('idx')
)

df_b.with_context(slices).select(
   pl.col('A', 'B'), pl.col('C', 'D').take('idx')
).collect()

# shape: (4, 4)
# ┌─────┬─────┬──────┬──────┐
# │ A   ┆ B   ┆ C    ┆ D    │
# │ --- ┆ --- ┆ ---  ┆ ---  │
# │ i64 ┆ i64 ┆ i64  ┆ i64  │
# ╞═════╪═════╪══════╪══════╡
# │ 1   ┆ 5   ┆ null ┆ null │
# │ 2   ┆ 4   ┆ 1    ┆ 11   │
# │ 3   ┆ 3   ┆ 1    ┆ 11   │
# │ 3   ┆ 3   ┆ 2    ┆ 12   │
# └─────┴─────┴──────┴──────┘

Specifying side='right' in the .search_sorted() would be "or equals to".

We can go in the other direction for df_a.A < df_b.C

slices = (
   df_b.sort('C')
       .with_context(df_a.sort('A'))
       .with_columns(start = 0) 
       .with_columns(
          idx = pl.arange('start', pl.col('A').search_sorted('C', side='left'))
       )
       .explode('idx')
)

df_a.with_context(slices).select(
   pl.col('A', 'B').take('idx'), pl.col('C', 'D')
).collect()

# shape: (10, 4)
# ┌──────┬──────┬─────┬─────┐
# │ A    ┆ B    ┆ C   ┆ D   │
# │ ---  ┆ ---  ┆ --- ┆ --- │
# │ i64  ┆ i64  ┆ i64 ┆ i64 │
# ╞══════╪══════╪═════╪═════╡
# │ null ┆ null ┆ 1   ┆ 11  │
# │ 1    ┆ 5    ┆ 2   ┆ 12  │
# │ 1    ┆ 5    ┆ 3   ┆ 13  │
# │ 2    ┆ 4    ┆ 3   ┆ 13  │
# │ 1    ┆ 5    ┆ 4   ┆ 14  │
# │ 2    ┆ 4    ┆ 4   ┆ 14  │
# │ 3    ┆ 3    ┆ 4   ┆ 14  │
# │ 1    ┆ 5    ┆ 5   ┆ 15  │
# │ 2    ┆ 4    ┆ 5   ┆ 15  │
# │ 3    ┆ 3    ┆ 5   ┆ 15  │
# └──────┴──────┴─────┴─────┘

Comparing these examples to duckdb generates the same results.

But perhaps there's more to it and this just solves a simplified version of the problem?

@stinodego stinodego added enhancement New feature or an improvement of an existing feature and removed question labels Aug 22, 2023
@stinodego
Copy link
Member

Closing in favor of #10068

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

10 participants