Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend asof_join for between #4206

Closed
alexanderpils opened this issue Aug 1, 2022 · 2 comments
Closed

Extend asof_join for between #4206

alexanderpils opened this issue Aug 1, 2022 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@alexanderpils
Copy link

alexanderpils commented Aug 1, 2022

Describe your feature request

This feature request is related to my question on stackoverflow, which has been already answered by cbilot.

The problem is that currently a "between" join is syntactical not very nice because two operations are necessary to do the job therefor I am thinking to extend the asof_join function for the parameter strategy with the keyword between.

Example

Data

import polars as pl
from datetime import date

df_1 = pl.DataFrame(
    {
        "id": ["abc", "abc", "456"],
        "start": [
              date(2022, 1, 1), 
             date(2022, 3, 4), 
             date(2022, 5, 11)
         ],
        "end": [
             date(2022, 2, 4), 
             date(2022, 3, 10), 
             date(2022, 5, 16)],
        "value": [10, 3, 4]

    }
)

df_2 = pl.DataFrame(
    {
        "id": ["abc", "abc", "456", "abc", "abc", "456", "abc", "abc", "abc"],
        "dates": [
            date(2022, 1, 2),
            date(2022, 3, 4),
            date(2022, 5, 11),
            date(2022, 1, 4),
            date(2022, 3, 7),
            date(2022, 5, 13),
            date(2021, 12, 31),
            date(2022, 3, 1),
            date(2023, 1, 1),
        ],
    }
).with_row_count()
df_2

So the aim is to join on id and that dates is in between start and end

##Current Solution (by cbilot)

(
    df_2
    .sort("dates")
    .join_asof(
        df_1.sort("start").with_column(pl.col("start").alias("start_df1")),
        by="id",
        left_on="dates",
        right_on="start",
        strategy="backward",
    )
    .with_column(
        pl.when(pl.col('dates') <= pl.col('end'))
        .then(pl.col('value'))
        .otherwise(None)
    )
    .sort("row_nr")
)

Result:

shape: (9, 6)
┌────────┬─────┬────────────┬────────────┬───────┬────────────┐
│ row_nr ┆ id  ┆ dates      ┆ end        ┆ value ┆ start_df1  │
│ ---    ┆ --- ┆ ---        ┆ ---        ┆ ---   ┆ ---        │
│ u32    ┆ str ┆ date       ┆ date       ┆ i64   ┆ date       │
╞════════╪═════╪════════════╪════════════╪═══════╪════════════╡
│ 0      ┆ abc ┆ 2022-01-02 ┆ 2022-02-04 ┆ 10    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ abc ┆ 2022-03-04 ┆ 2022-03-10 ┆ 3     ┆ 2022-03-04 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 2022-05-11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ abc ┆ 2022-01-04 ┆ 2022-02-04 ┆ 10    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4      ┆ abc ┆ 2022-03-07 ┆ 2022-03-10 ┆ 3     ┆ 2022-03-04 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5      ┆ 456 ┆ 2022-05-13 ┆ 2022-05-16 ┆ 4     ┆ 2022-05-11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6      ┆ abc ┆ 2021-12-31 ┆ null       ┆ null  ┆ null       │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7      ┆ abc ┆ 2022-03-01 ┆ 2022-02-04 ┆ null  ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 8      ┆ abc ┆ 2023-01-01 ┆ 2022-03-10 ┆ null  ┆ 2022-03-04 │
└────────┴─────┴────────────┴────────────┴───────┴────────────┘

Suggested Solution

(
    df_2
    .join_asof(
        df_1,
        by="id",
        left_on="dates",
        right_on=[("start", "end")],
        strategy="between",
    )
)

Result:

┌────────┬─────┬────────────┬────────────┬───────┬────────────┐
│ row_nr ┆ id  ┆ dates      ┆ end        ┆ value ┆ start  │
│ ---    ┆ --- ┆ ---        ┆ ---        ┆ ---   ┆ ---        │
│ u32    ┆ str ┆ date       ┆ date       ┆ i64   ┆ date       │
╞════════╪═════╪════════════╪════════════╪═══════╪════════════╡
│ 0      ┆ abc ┆ 2022-01-02 ┆ 2022-02-04 ┆ 10    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ abc ┆ 2022-03-04 ┆ 2022-03-10 ┆ 3     ┆ 2022-03-04 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 2022-05-11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ abc ┆ 2022-01-04 ┆ 2022-02-04 ┆ 10    ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4      ┆ abc ┆ 2022-03-07 ┆ 2022-03-10 ┆ 3     ┆ 2022-03-04 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5      ┆ 456 ┆ 2022-05-13 ┆ 2022-05-16 ┆ 4     ┆ 2022-05-11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤

So three parameters would have to be adjusted for this new feature. The parameter strategy gets an additional keyword like between, the parameters right_on and left_on take list of tuples/ tuples for lower and upper bound.

@thomasfrederikhoeck
Copy link
Contributor

thomasfrederikhoeck commented Oct 12, 2022

I guess it is more than just syntactical as the answer in SO assumes non-overlapping while this doesn't :-)
From SO:

I'm going to assume that your intervals in df_1 do not overlap for a particular id - otherwise, there may not be a unique value that we can assign to the id/dates combinations in df_2

I think it would be great with a version that supports overlapping intervals. Something similar to what DuckDB is supporting here with IEJOIN https://duckdb.org/2022/05/27/iejoin.html

@ritchie46
Copy link
Member

Added non-equi joins in #18365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants