Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining on expression can override column value #8874

Open
2 tasks done
thomasaarholt opened this issue May 16, 2023 · 1 comment
Open
2 tasks done

Joining on expression can override column value #8874

thomasaarholt opened this issue May 16, 2023 · 1 comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@thomasaarholt
Copy link
Contributor

thomasaarholt commented May 16, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

In the following polars code, I expect that the join condition pl.col("two_coords").arr.first() doesn't actually influence the value of any columns in the new dataframe - it should tell polars which row gets which value (IMO?). Instead, it overwrites the column value with the value of that expression.

I'm assuming this has something to do with how when using with_columns, one will replace the first named column in an expression with the result of that expression.

Reproducible example

import polars as pl
df = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "two_coords": [(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)]
    }
)
df_allowed_left_coords = pl.DataFrame(
    {
        "allowed_first_coord": [1.0, 3.0]
    }
)
df.join(
    df_allowed_left_coords,
    how="inner",
    left_on=pl.col("two_coords").list.first(), # `two_coords` incorrectly takes on this value
    right_on="allowed_first_coord",
)
# results in:
# pl.DataFrame(
#     {
#         'id': [1, 2],
#         'two_coords': [1.0, 3.0]
#     }
# )


df_expected = pl.DataFrame(
    {
        "id": [1, 2],
        "two_coords": [(1.0, 2.0), (3.0, 4.0)]
    }
)

Expected behavior

df_expected as above, where only the first two rows are kept, with both values intact in two_coords.

Installed versions

--------Version info---------
Polars:      0.17.13
Index type:  UInt32
Platform:    macOS-13.3.1-arm64-arm-64bit
Python:      3.11.1 (main, Mar  2 2023, 10:47:50) [Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
numpy:       1.24.2
pandas:      1.5.3
pyarrow:     12.0.0
connectorx:  <not installed>
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  3.7.1
xlsx2csv:    <not installed>
xlsxwriter:  <not installed>
@thomasaarholt thomasaarholt added bug Something isn't working python Related to Python Polars labels May 16, 2023
@havspect
Copy link

havspect commented Aug 1, 2023

Hi,

The error is still relevant in the latest polars version. I came across the same problem and while it is easy to fix (e.g., using with_columns and dropping the column afterward) it is a highly unexpected behavior.

In my opinion, two possible fixes exist:

  1. Include a warning in the documentation. The warning should state that using an expression leads to an overwritten column.
  2. Fix the bug within the code. Here I am not quite certain where to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants