Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

edavisau
Copy link
Contributor

@edavisau edavisau commented Jan 26, 2024

Hi @ritchie46, I'm seeking early feedback on this PR

At first I was investigating #9603. The problem there is that literals (/ single value expressions) are not being broadcasted properly.

Then I was reading the code and noticed some other issues. We have disallowed aliases in join expressions (#6312), however, this still has some flaws when there are duplicate names: e.g.

df1 = pl.DataFrame({
    'a': [1, 2, 3],
})

df2 = pl.DataFrame({
    'a': [2],
    'b': [4],
    'c': [6],
    'extra_col': ["foo"]
})


df1.join(
    df2,
    left_on=[
    	'a', 
    	pl.col('a') * 2,  # also named "a"
    	pl.col('a') * 3,  # also named "a'"
    ],
    right_on=['a', 'b', 'c'],
    how="left",
)
shape: (3, 2)
┌─────┬───────────┐
│ a   ┆ extra_col │
│ --- ┆ ---       │
│ i64 ┆ str       │
╞═════╪═══════════╡
│ 3   ┆ null      │
│ 6   ┆ foo       │
│ 9   ┆ null      │
└─────┴───────────┘

The column a is overridden with the last expression in left_on. I later saw similar reports in #8874, #13220

My idea for the latter problem is to not hstack the left_on/right_on expressions onto the dataframes before the join (as currently done). I assume these series do not need to be kept in the result, so they are dropped when the join has finished. If we do want to keep them, we could either error when there are duplicate column names, or add suffixes to columns e.g. a_left1, a_left2 in the example above.

Assuming we don't keep those series, the next problem is knowing which ones to drop. Example

df1 = pl.DataFrame([
    pl.Series("a", ["1", "2"], pl.String),
])

df2 = pl.DataFrame([
    pl.Series("b", ["1", "2"], pl.String),
    pl.Series("c", ["c", "c"], pl.String),
    pl.Series("d", [3, 5], pl.Int32)
])


df1.join(
    df2,
    left_on=["a", pl.lit("c")],
    right_on=["b", "c"],
    how="left",
)

Currently we will get the following result. Note that "c" is dropped, it seems to me that you want to keep it in this situation.

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ d   │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 5   │
└─────┴─────┘

A sensible new condition for dropping columns in the right df could be: if the left_on/right_on expressions were already columns in the original dfs. I have added an implementation to do this for left joins in the PR so far.

Example

df1 = pl.DataFrame([
    pl.Series("a", ["1", "2"], pl.String),
    pl.Series("b", [1, 1], pl.Int32),
])

df2 = pl.DataFrame([
    pl.Series("c", ["2", "3"], pl.String),
    pl.Series("d", [2, 2], pl.Int32),
])


df1.join(
    df2,
    left_on=["a", pl.lit(2), "b", pl.lit("foo")],
    right_on=["c", "d", pl.lit(1), pl.lit("foo")],
    how="left",
)
shape: (2, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ d    │
│ --- ┆ --- ┆ ---  │
│ str ┆ i32 ┆ i32  │
╞═════╪═════╪══════╡
│ 1   ┆ 1   ┆ null │
│ 2   ┆ 1   ┆ 2    │
└─────┴─────┴──────┘

eprintln: Dropping names in right df: ["c"]

We dropped c here because a and c are not calculated expressions and they already exist in the left/right dataframes.

I hope these examples are not too confusing 😄

Final thing, if we are not including the left_on/right_on series, do we still need to disallow aliases?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant