Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST/DOC]: Join key matching semantics for expression keys #17184

Open
wence- opened this issue Jun 25, 2024 · 2 comments
Open

[QST/DOC]: Join key matching semantics for expression keys #17184

wence- opened this issue Jun 25, 2024 · 2 comments

Comments

@wence-
Copy link
Collaborator

wence- commented Jun 25, 2024

After the merge of #17061, expression-based join keys do not appear in the output of the join. However, there are still a few open issues (especially around matching with pl.lit join keys, e.g. #9603).

A few questions about what the desired behaviour should be in some corner cases (note: these arise because in the cudf-polars work I had written slightly different broadcasting semantics for these edge cases compared to polars, so I'm trying to figure out what is "right"):

Multiple join keys provided, but they don't all produce columns of the same length

import polars as pl

left = pl.LazyFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
right = pl.LazyFrame({"a": [1, 2, 3], "b": [3, 4, 5]})

q = left.join(
    right, 
    left_on=[pl.col("a"), pl.col("b")],
    right_on=[pl.col("a").slice(0, 2), pl.col("b")],
    how="inner"
)

actual = q.collect()
print(actual)
shape: (2, 4)
┌─────┬─────┬─────────┬─────────┐
│ aba_rightb_right │
│ ------------     │
│ i64i64i64i64     │
╞═════╪═════╪═════════╪═════════╡
│ 1313       │
│ 2424       │
└─────┴─────┴─────────┴─────────┘

Since pl.col("a").slice(0, 2) and pl.col("b") produce columns of different lengths when evaluated, I was expecting a ComputeError here.

Join key coalescing

#17601 turns off join key coalescing if any key is not a column reference. But this seems a bit over-eager, if we join on multiple keys, only some of which are expressions, then I might expect that column references are still coalesced.

import polars as pl

left = pl.LazyFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
right = pl.LazyFrame({"a": [1, 2, 3], "b": [3, 4, 5]})

q = left.join(
    right, 
    left_on=[pl.col("a"), pl.col("b") % 4],
    right_on=[pl.col("a"), pl.col("b") % 5],
    how="inner",
    coalesce=True,
)
actual = q.collect()
print(actual)
shape: (1, 4)
┌─────┬─────┬─────────┬─────────┐
│ aba_rightb_right │
│ ------------     │
│ i64i64i64i64     │
╞═════╪═════╪═════════╪═════════╡
│ 1313       │
└─────┴─────┴─────────┴─────────┘

I might expect that a is coalesced.

Difference in behaviour when literals are keys

import polars as pl

left = pl.LazyFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
right = pl.LazyFrame({"a": [1, 2, 3], "b": [3, 4, 5]})

q1 = left.join(
    right, 
    left_on=pl.col("b"),
    right_on=pl.lit(5).cast(int),
    how="semi",
    coalesce=True,
)
actual1 = q1.collect()
print(actual1)
shape: (1, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64i64 │
╞═════╪═════╡
│ 35   │
└─────┴─────┘

q2 = left.join(
    right,
    left_on=[pl.col("a"), pl.col("b")],
    right_on=[pl.col("a"), pl.lit(5).cast(int)],
    how="semi",
    coalesce=True,
)
actual2 = q2.collect()
print(actual2)
shape: (0, 2)
┌─────┬─────┐
│ ab   │
│ ------ │
│ i64i64 │
╞═════╪═════╡
└─────┴─────┘

I was expecting these two to produce the same result because the first is (to my mind) equivalent to: left.filter(pl.col("b") == pl.lit(5).cast(int)) and the second is left.filter(pl.all_horizontal(pl.col("a") == pl.col("a"), pl.col("b") == pl.lit(5).cast(int)).

@ritchie46
Copy link
Member

I completely missed this one.

Multiple join keys provided, but they don't all produce columns of the same length

This should definitely raise. We should check if the join key expression is elementwise during conversion from DSL to IR.

#17601 turns off join key coalescing if any key is not a column reference. But this seems a bit over-eager, if we join on multiple keys, only some of which are expressions, then I might expect that column references are still coalesced.

I think we should warn if a user asks for explict coalescing and passes partial non columns expressions. As this requires a more complicated coalescing logic that isn't supported at the moment.

Difference in behaviour when literals are keys

I am also surprise by that one. Seems like a bug. 🤔 Would have to investigate.

@wence-
Copy link
Collaborator Author

wence- commented Jul 9, 2024

Thanks: opened #17517 for the first two points, didn't manage to dig at all for the last one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants