-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
join/on expression mutates column #13220
Comments
It's not really a bug. That's what is intended when you use an expression in the join. The way to get what you want is
The above could be more concise if this one is implemented. Maybe, this could be reformulated as a feature request to allow something like temp_left_on and/or temp_right_on so that it would join by that expression but not keep it. It would only make sense in the context of allowing the retention of right keys. With both features you could then do
|
Oh, ok.
I don't think this feature would be necessary, I'm ok with creating and dropping the temporary column. |
I would argue that this is indeed a bug as it differs from the SQL standard (which is what polars seems to be wanting to align to particularly in the case of joins). The docstring for In SQL, regardless of whether a join key is an expression or just a plain column, it does not mutate the data. I think another example using df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
"amount": ["1", "2"],
"value": [25, 75]
})
print(
df_left
.join(
df_right,
# I only want to use this expression to join the dfs,
# not mutate data in this column
left_on=pl.col("category").str.extract(r"(\d)"),
right_on="amount"
)
# Now I need to do another join on the `category` column in the left table
# but it has been mutated by `left_on` to something it isn't.
)
# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞══════════╪═══════╡
# │ 1 ┆ 25 │
# │ 2 ┆ 75 │
# └──────────┴───────┘ Compare this to SQL via duckdb query = r"""
select df_left.*, df_right.value
from df_left
inner join df_right
-- `regexp_extract` is akin to polars `pl.col("...").str.extract`
on regexp_extract(df_left.category, '\d') = df_right.amount
"""
print(duckdb.sql(query))
# ┌──────────┬───────┐
# │ category │ value │
# │ varchar │ int64 │
# ├──────────┼───────┤
# │ amount_1 │ 25 │
# │ amount_2 │ 75 │
# └──────────┴───────┘ Postgres has the same behaviour as duckdb. Polars can't run this query due to "InvalidOperationError: Unsupported SQL join constraint" As a final point as to why I think the current behaviour is buggy, putting the expression in print(
df_right.join(
df_left,
# `amount` doesn't get mutated even though `category` did in the previous example
left_on="amount",
right_on=pl.col("category").str.extract(r"(\d)"),
)
)
# shape: (2, 2)
# ┌────────┬───────┐
# │ amount ┆ value │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞════════╪═══════╡
# │ 1 ┆ 25 │
# │ 2 ┆ 75 │
# └────────┴───────┘ And for good measure, an example like OPs showing df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
"category": ["1", "2"],
"value": [25, 75]
})
# Again, the `category` is mutated in the left table
print(df_left.join(df_right, on=pl.col("category").str.extract(r"(\d)")))
# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞══════════╪═══════╡
# │ 1 ┆ 25 │
# │ 2 ┆ 75 │
# └──────────┴───────┘ FYI @stinodego for your thoughts |
Hi @ritchie46 and @stinodego, may I be request your opinion on whether this issue is a bug, or intended bahviour? |
This looks wrong to me. A bit more of a compact repro: from datetime import datetime
import polars as pl
datetimes = [datetime(2022, 1, 1), datetime(2022, 1, 1, 12), datetime(2022, 1, 2)]
df1 = pl.DataFrame({"dt": datetimes, "value": 0})
df2 = pl.DataFrame({"dt": datetimes, "value": 1})
result = df1.join(df2, on=pl.col("dt").dt.date())
print(result)
|
I think this is actually related to our story on I definitely agree it's a bug that expressions in However, when you pass an expression instead of just a column into these arguments, |
I think I will just turn off coalescing if the join expressions are anything other than |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
Issue description
When joining 2 DataFrames, using expressions as join conditions may mutate the original column values.
Expected behavior
Installed versions
The text was updated successfully, but these errors were encountered: