join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007

edavisau · 2024-01-26T06:34:53Z

Hi @ritchie46, I'm seeking early feedback on this PR

At first I was investigating #9603. The problem there is that literals (/ single value expressions) are not being broadcasted properly.

Then I was reading the code and noticed some other issues. We have disallowed aliases in join expressions (#6312), however, this still has some flaws when there are duplicate names: e.g.

df1 = pl.DataFrame({
    'a': [1, 2, 3],
})

df2 = pl.DataFrame({
    'a': [2],
    'b': [4],
    'c': [6],
    'extra_col': ["foo"]
})


df1.join(
    df2,
    left_on=[
    	'a', 
    	pl.col('a') * 2,  # also named "a"
    	pl.col('a') * 3,  # also named "a'"
    ],
    right_on=['a', 'b', 'c'],
    how="left",
)

shape: (3, 2)
┌─────┬───────────┐
│ a   ┆ extra_col │
│ --- ┆ ---       │
│ i64 ┆ str       │
╞═════╪═══════════╡
│ 3   ┆ null      │
│ 6   ┆ foo       │
│ 9   ┆ null      │
└─────┴───────────┘

The column a is overridden with the last expression in left_on. I later saw similar reports in #8874, #13220

My idea for the latter problem is to not hstack the left_on/right_on expressions onto the dataframes before the join (as currently done). I assume these series do not need to be kept in the result, so they are dropped when the join has finished. If we do want to keep them, we could either error when there are duplicate column names, or add suffixes to columns e.g. a_left1, a_left2 in the example above.

Assuming we don't keep those series, the next problem is knowing which ones to drop. Example

df1 = pl.DataFrame([
    pl.Series("a", ["1", "2"], pl.String),
])

df2 = pl.DataFrame([
    pl.Series("b", ["1", "2"], pl.String),
    pl.Series("c", ["c", "c"], pl.String),
    pl.Series("d", [3, 5], pl.Int32)
])


df1.join(
    df2,
    left_on=["a", pl.lit("c")],
    right_on=["b", "c"],
    how="left",
)

Currently we will get the following result. Note that "c" is dropped, it seems to me that you want to keep it in this situation.

shape: (2, 2)
┌─────┬─────┐
│ a   ┆ d   │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 5   │
└─────┴─────┘

A sensible new condition for dropping columns in the right df could be: if the left_on/right_on expressions were already columns in the original dfs. I have added an implementation to do this for left joins in the PR so far.

Example

df1 = pl.DataFrame([
    pl.Series("a", ["1", "2"], pl.String),
    pl.Series("b", [1, 1], pl.Int32),
])

df2 = pl.DataFrame([
    pl.Series("c", ["2", "3"], pl.String),
    pl.Series("d", [2, 2], pl.Int32),
])


df1.join(
    df2,
    left_on=["a", pl.lit(2), "b", pl.lit("foo")],
    right_on=["c", "d", pl.lit(1), pl.lit("foo")],
    how="left",
)

shape: (2, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ d    │
│ --- ┆ --- ┆ ---  │
│ str ┆ i32 ┆ i32  │
╞═════╪═════╪══════╡
│ 1   ┆ 1   ┆ null │
│ 2   ┆ 1   ┆ 2    │
└─────┴─────┴──────┘

eprintln: Dropping names in right df: ["c"]

We dropped c here because a and c are not calculated expressions and they already exist in the left/right dataframes.

I hope these examples are not too confusing 😄

Final thing, if we are not including the left_on/right_on series, do we still need to disallow aliases?

Thanks!

edavisau added 6 commits January 26, 2024 10:06

extract inner function from with_column

ce7d3ed

add todos

cd836f6

refactor

32d08fd

add todo

5a9887a

drop columns conditionally in left join

bf37ec5

clippy

b68f68e

ritchie46 force-pushed the main branch from 0a696ff to 9c29683 Compare July 28, 2024 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007

join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007

edavisau commented Jan 26, 2024 •

edited

Loading

join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007

Are you sure you want to change the base?

join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007

Conversation

edavisau commented Jan 26, 2024 • edited Loading

edavisau commented Jan 26, 2024 •

edited

Loading