Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join/on expression mutates column #13220

Closed
2 tasks done
accer-bndes opened this issue Dec 23, 2023 · 7 comments · Fixed by #17061
Closed
2 tasks done

join/on expression mutates column #13220

accer-bndes opened this issue Dec 23, 2023 · 7 comments · Fixed by #17061
Assignees
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@accer-bndes
Copy link

accer-bndes commented Dec 23, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from datetime import datetime

test0 = pl.DataFrame({
    "datetime": pl.datetime_range(datetime(2022, 1, 1), datetime(2022, 1, 2), "1h", eager=True),
    "value": 0,
})

test1 = pl.DataFrame({
    "datetime": pl.datetime_range(datetime(2022, 1, 1), datetime(2022, 1, 2), "1h", eager=True),
    "value": 1,
})

print(test0)
print(test0.join(test1, on=pl.col("datetime").dt.date()))

Log output

shape: (25, 2)
┌─────────────────────┬───────┐
│ datetime            ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2022-01-01 00:00:00 ┆ 0     │
│ 2022-01-01 01:00:00 ┆ 0     │
│ 2022-01-01 02:00:00 ┆ 0     │
│ 2022-01-01 03:00:00 ┆ 0     │
│ …                   ┆ …     │
│ 2022-01-01 21:00:00 ┆ 0     │
│ 2022-01-01 22:00:00 ┆ 0     │
│ 2022-01-01 23:00:00 ┆ 0     │
│ 2022-01-02 00:00:00 ┆ 0     │
└─────────────────────┴───────┘
shape: (577, 3)
┌────────────┬───────┬─────────────┐
│ datetime   ┆ value ┆ value_right │
│ ---        ┆ ---   ┆ ---         │
│ date       ┆ i64   ┆ i64         │
╞════════════╪═══════╪═════════════╡
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ …          ┆ …     ┆ …           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-02 ┆ 0     ┆ 1           │
└────────────┴───────┴─────────────┘

Issue description

When joining 2 DataFrames, using expressions as join conditions may mutate the original column values.

Expected behavior

shape: (577, 4)
┌─────────────────────┬───────┬─────────────────────┬─────────────┐
│ datetime            ┆ value ┆ datetime_right      ┆ value_right │
│ ---                 ┆ ---   ┆ ---                 ┆ ---         │
│ datetime[μs]        ┆ i64   ┆ datetime[μs]        ┆ i64         │
╞═════════════════════╪═══════╪═════════════════════╪═════════════╡
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 00:00:00 ┆ 1           │
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 01:00:00 ┆ 1           │
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 02:00:00 ┆ 1           │
│ 2022-01-01 00:00:00 ┆ 0     ┆ 2022-01-01 03:00:00 ┆ 1           │
│ …                   ┆ …     ┆ …                   ┆ …           │
│ 2022-01-01 23:00:00 ┆ 0     ┆ 2022-01-01 21:00:00 ┆ 1           │
│ 2022-01-01 23:00:00 ┆ 0     ┆ 2022-01-01 22:00:00 ┆ 1           │
│ 2022-01-01 23:00:00 ┆ 0     ┆ 2022-01-01 23:00:00 ┆ 1           │
│ 2022-01-02 00:00:00 ┆ 0     ┆ 2022-01-02 00:00:00 ┆ 1           │
└─────────────────────┴───────┴─────────────────────┴─────────────┘

Installed versions

--------Version info---------
Polars:               0.20.2
Index type:           UInt32
Platform:             Linux-6.6.7-zen1-1-zen-x86_64-with-glibc2.38
Python:               3.12.0 (main, Oct 12 2023, 16:15:51) [GCC 13.2.1 20230801]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.4
pyarrow:              <not installed>
pydantic:             2.5.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@accer-bndes accer-bndes added bug Something isn't working python Related to Python Polars labels Dec 23, 2023
@deanm0000
Copy link
Collaborator

It's not really a bug. That's what is intended when you use an expression in the join. The way to get what you want is

(test0
      .with_columns(
          pl.col("datetime").dt.date()
          )
       .join(
           test1
           .with_columns(
               pl.col("datetime").dt.date(),
               datetime_right="datetime"
           ), 
                on="datetime")
)

The above could be more concise if this one is implemented.

Maybe, this could be reformulated as a feature request to allow something like temp_left_on and/or temp_right_on so that it would join by that expression but not keep it.

It would only make sense in the context of allowing the retention of right keys.

With both features you could then do

test0.join(test1, left_on=pl.col("datetime").dt.date(),
                              temp_right_on=pl.col("datetime").dt.date(),
                               keep_right=True
)

@deanm0000 deanm0000 added enhancement New feature or an improvement of an existing feature and removed bug Something isn't working labels Dec 23, 2023
@accer-bndes
Copy link
Author

accer-bndes commented Dec 24, 2023

Oh, ok.
That workaround loses the time from the left dataframe, but I know I can:

date_tmp = pl.col("datetime").dt.date().alias("date")

test0.with_columns(date_tmp).join(
    test1.with_columns(date_tmp), on="date"
).drop("date")

I don't think this feature would be necessary, I'm ok with creating and dropping the temporary column.
Just didn't expect on to change the original column... maybe a documentation issue?

@henryharbeck
Copy link
Contributor

I would argue that this is indeed a bug as it differs from the SQL standard (which is what polars seems to be wanting to align to particularly in the case of joins). The docstring for join even says "Join in SQL-like fashion."

In SQL, regardless of whether a join key is an expression or just a plain column, it does not mutate the data.

I think another example using left_on and right_on really highlights why I believe this is a bug

df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
    "amount": ["1", "2"],
    "value": [25, 75]
})

print(
    df_left
    .join(
        df_right,
        # I only want to use this expression to join the dfs,
        # not mutate data in this column
        left_on=pl.col("category").str.extract(r"(\d)"),
        right_on="amount"
    )
    # Now I need to do another join on the `category` column in the left table
    # but it has been mutated by `left_on` to something it isn't.
)

# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ ---      ┆ ---   │
# │ str      ┆ i64   │
# ╞══════════╪═══════╡
# │ 1        ┆ 25    │
# │ 2        ┆ 75    │
# └──────────┴───────┘

Compare this to SQL via duckdb

query = r"""
select df_left.*, df_right.value
from df_left
inner join df_right
-- `regexp_extract` is akin to polars `pl.col("...").str.extract`
on regexp_extract(df_left.category, '\d') = df_right.amount
"""

print(duckdb.sql(query))

# ┌──────────┬───────┐
# │ category │ value │
# │ varchar  │ int64 │
# ├──────────┼───────┤
# │ amount_1 │    25 │
# │ amount_2 │    75 │
# └──────────┴───────┘

Postgres has the same behaviour as duckdb. Polars can't run this query due to "InvalidOperationError: Unsupported SQL join constraint"

As a final point as to why I think the current behaviour is buggy, putting the expression in right_on expectedly won't mutate anything (but left_on does). Although the output here is expected, it does highlight the inconsistency

print(
    df_right.join(
        df_left,
        # `amount` doesn't get mutated even though `category` did in the previous example
        left_on="amount",
        right_on=pl.col("category").str.extract(r"(\d)"),
    )
)

# shape: (2, 2)
# ┌────────┬───────┐
# │ amount ┆ value │
# │ ---    ┆ ---   │
# │ str    ┆ i64   │
# ╞════════╪═══════╡
# │ 1      ┆ 25    │
# │ 2      ┆ 75    │
# └────────┴───────┘

And for good measure, an example like OPs showing on mutating the data in the left table

df_left = pl.DataFrame({"category": ["amount_1", "amount_2"]})
df_right = pl.DataFrame({
    "category": ["1", "2"],
    "value": [25, 75]
})

# Again, the `category` is mutated in the left table
print(df_left.join(df_right, on=pl.col("category").str.extract(r"(\d)")))

# shape: (2, 2)
# ┌──────────┬───────┐
# │ category ┆ value │
# │ ---      ┆ ---   │
# │ str      ┆ i64   │
# ╞══════════╪═══════╡
# │ 1        ┆ 25    │
# │ 2        ┆ 75    │
# └──────────┴───────┘

FYI @stinodego for your thoughts

@henryharbeck
Copy link
Contributor

Hi @ritchie46 and @stinodego, may I be request your opinion on whether this issue is a bug, or intended bahviour?

@stinodego
Copy link
Member

stinodego commented Jan 8, 2024

This looks wrong to me.

A bit more of a compact repro:

from datetime import datetime

import polars as pl

datetimes = [datetime(2022, 1, 1), datetime(2022, 1, 1, 12), datetime(2022, 1, 2)]
df1 = pl.DataFrame({"dt": datetimes, "value": 0})
df2 = pl.DataFrame({"dt": datetimes, "value": 1})
result = df1.join(df2, on=pl.col("dt").dt.date())
print(result)
shape: (5, 3)
┌────────────┬───────┬─────────────┐
│ dt         ┆ value ┆ value_right │
│ ---        ┆ ---   ┆ ---         │
│ date       ┆ i64   ┆ i64         │
╞════════════╪═══════╪═════════════╡
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-01 ┆ 0     ┆ 1           │
│ 2022-01-02 ┆ 0     ┆ 1           │
└────────────┴───────┴─────────────┘

@stinodego stinodego added the accepted Ready for implementation label Jan 8, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 8, 2024
@stinodego stinodego added bug Something isn't working P-high Priority: high and removed enhancement New feature or an improvement of an existing feature labels Jan 8, 2024
@stinodego stinodego moved this from Ready to Candidate in Backlog Jan 8, 2024
@stinodego stinodego removed the accepted Ready for implementation label Jan 12, 2024
@ritchie46 ritchie46 added P-medium Priority: medium and removed P-medium Priority: medium labels Feb 29, 2024
@orlp
Copy link
Collaborator

orlp commented Apr 12, 2024

I think this is actually related to our story on coalesce in joins.

I definitely agree it's a bug that expressions in on, left_on and right_on modify the output, they should not.

However, when you pass an expression instead of just a column into these arguments, coalesce-ing is no longer allowed if the expression isn't an injective function.

@ritchie46 ritchie46 self-assigned this Jun 19, 2024
@ritchie46
Copy link
Member

I think I will just turn off coalescing if the join expressions are anything other than col.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants