Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the parameter on of the join function for more complex operations #4207

Closed
alexanderpils opened this issue Aug 1, 2022 · 5 comments
Closed

Comments

@alexanderpils
Copy link

This is partially related to the other ticket I created #4206, since this would be also a solution to the problem but since I think it would be a major restructure and changes of the underlying code.
So my suggestion is to allow for complex expression in the parameter on in the join function to give the user more flexibility and kind of sql feeling.

So what to I mean with more complex expression, so for that I want to give an example how it could look like
So the example is the same as in #4206

Example

Data

import polars as pl
from datetime import date

df_1 = pl.DataFrame(
    {
        "id": ["abc", "abc", "456"],
        "start": [
              date(2022, 1, 1), 
             date(2022, 3, 4), 
             date(2022, 5, 11)
         ],
        "end": [
             date(2022, 2, 4), 
             date(2022, 3, 10), 
             date(2022, 5, 16)],
        "value": [10, 3, 4]

    }
)

df_2 = pl.DataFrame(
    {
        "id": ["abc", "abc", "456", "abc", "abc", "456", "abc", "abc", "abc"],
        "dates": [
            date(2022, 1, 2),
            date(2022, 3, 4),
            date(2022, 5, 11),
            date(2022, 1, 4),
            date(2022, 3, 7),
            date(2022, 5, 13),
            date(2021, 12, 31),
            date(2022, 3, 1),
            date(2023, 1, 1),
        ],
    }
).with_row_count()
df_2

So the aim is to join on id and that dates is in between start and end

Solution

(
    df_2
    .join(
        df_1,
        on = [
                  "id", 
                  (df_2.select(pl.col("dates")) >= df_1.select(pl.col("start"))) & 
                  (df_2.select(pl.col("dates")) <= df_2.select(pl.col("end")))
      ]
    )
)

So the feature request is to allow a syntax like this (df_2.select(pl.col("dates")) >= df_1.select(pl.col("start"))) which results in a series of TRUE and FALSE values in the on parameter.

@alexanderpils alexanderpils changed the title Extend the parameter on of the join function for more comple operations Extend the parameter on of the join function for more complex operations Aug 1, 2022
@ritchie46
Copy link
Member

Thanks for your suggestions.

These are non trivial requests. non-equi joins might be on the road map in the future. But is currently not in the road map as it does not fit on the join architectures that are implemented.

@alexanderpils
Copy link
Author

Ok thank you for the information. Do you want that I close then the ticket for now?

@zundertj
Copy link
Collaborator

zundertj commented Aug 1, 2022

This is a duplicate of #3438, please add your vote/support there. Closing this issue.

@evbo
Copy link

evbo commented Apr 18, 2023

@ritchie46 just curious to know if you could clarify on how non-equi joins aren't supported yet there does appear to be support for apply of custom filtering logic with cross joins, e.g.:

https://stackoverflow.com/a/74392766/1080804

So are cross-joins an exception because they exhaustively create every combination whereas other joins don't?

And.. are cross-joins the best path forward for those of us that need custom logic like this? The only other approach I could think of was filtering from dynamic list of conditions (where the list is collected from a separate df, rather than joining), e.g.:
https://stackoverflow.com/a/74994913/1080804

@ritchie46
Copy link
Member

Added non-equi joins in #18365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants