Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix duplicate column names after join if suffix already present #21315

Merged
merged 14 commits into from
Feb 19, 2025

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Feb 18, 2025

Updates to use (newly added) Schema::try_insert() instead of Schema::with_column during IR resolution to catch duplicate errors.

Note, there are a lot more places in the codebase that should probably be using try_insert() instead of with_column.

@nameexhaustion nameexhaustion changed the title fix: Fix duplicate column names after join if suffixed already present fix: Fix duplicate column names after join if suffix already present Feb 18, 2025
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Feb 18, 2025
// will result in
//
// df(cols=[B, A, B_right])
JoinType::Right if options.args.should_coalesce() => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by - we only need the dedicated right-join schema resolution branch for coalesce=True

@nameexhaustion
Copy link
Collaborator Author

nameexhaustion commented Feb 18, 2025

*Edit - outdated, see further comments

Note that as a side effect of logic to prevent duplicate errors, this changes the output when joining on unnamed relations:

import polars as pl

df = pl.DataFrame({"a": 1, "b": 1})

with pl.SQLContext({"df1": df, "df2": df, "df3": df}) as ctx:
    q = ctx.execute("""\
SELECT *
FROM df1
JOIN (df2 JOIN df3 ON df2.a = df3.a) ON df1.a = df2.a
""")
    print(q.collect())

# Before
# shape: (1, 6)
# ┌─────┬─────┬─────┬─────┬───────┬───────┐
# │ a   ┆ b   ┆ a:  ┆ b:  ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ---   ┆ ---   │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64   │
# ╞═════╪═════╪═════╪═════╪═══════╪═══════╡
# │ 1   ┆ 1   ┆ 1   ┆ 1   ┆ 1     ┆ 1     │
# └─────┴─────┴─────┴─────┴───────┴───────┘
# After
# shape: (1, 6)
# ┌─────┬─────┬───────────────────┬───────────────────┬───────┬───────┐
# │ a   ┆ b   ┆ a:__UNNAMED_TBL_0 ┆ b:__UNNAMED_TBL_0 ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ ---               ┆ ---               ┆ ---   ┆ ---   │
# │ i64 ┆ i64 ┆ i64               ┆ i64               ┆ i64   ┆ i64   │
# ╞═════╪═════╪═══════════════════╪═══════════════════╪═══════╪═══════╡
# │ 1   ┆ 1   ┆ 1                 ┆ 1                 ┆ 1     ┆ 1     │
# └─────┴─────┴───────────────────┴───────────────────┴───────┴───────┘

@alexander-beedie does this look okay to you?

Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 92.62295% with 9 lines in your changes missing coverage. Please review.

Project coverage is 79.92%. Comparing base (de1d9d5) to head (ca2e92f).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-sql/src/context.rs 68.75% 5 Missing ⚠️
crates/polars-plan/src/plans/schema.rs 94.20% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21315      +/-   ##
==========================================
+ Coverage   79.85%   79.92%   +0.07%     
==========================================
  Files        1596     1596              
  Lines      228642   228716      +74     
  Branches     2615     2615              
==========================================
+ Hits       182571   182801     +230     
+ Misses      45473    45317     -156     
  Partials      598      598              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 18, 2025

@alexander-beedie does this look okay to you?

Hmm... it might actually be a better idea to raise an error here, to ensure that reasonable aliasing exists for subsequent selections/operations. Something like: polars_bail!(SQLInterface: "cannot join on unnamed relation; please provide an alias") 🤔

Some SQL dialects allow joining on unnamed relations, but others don't (I think Oracle is one of the latter, IIRC), and those that do have different ways to resolve the resulting column names; duckdb can return duplicate columns, for example...

import duckdb as dd

dd.sql("""
 SELECT * FROM df1
 JOIN (df2 JOIN df3 ON df2.a = df3.a) ON df1.a = df2.a
""")
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │   a   │   b   │   a   │   b   │   a   │   b   │
# │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┼───────┼───────┤
# │     1 │     1 │     1 │     1 │     1 │     1 │
# └───────┴───────┴───────┴───────┴───────┴───────┘

...or:

dd.sql("""
  SELECT * FROM df1
  JOIN (df2 JOIN df3 ON df2.a = df3.a) AS dfx ON df1.a = dfx.a
""")
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │   a   │   b   │   a   │   b   │  a_1  │  b_1  │
# │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┼───────┼───────┤
# │     1 │     1 │     1 │     1 │     1 │     1 │
# └───────┴───────┴───────┴───────┴───────┴───────┘

If we raise on unnamed join relations, we'll get clearer output when the SQL is adjusted to include an alias (like so):

pl.sql("""
  SELECT * FROM df1
  JOIN (df2 JOIN df3 ON df2.a = df3.a) AS dfx ON df1.a = dfx.a
""").collect()
# shape: (1, 6)
# ┌─────┬─────┬───────┬───────┬───────┬───────┐
# │ a   ┆ b   ┆ a:dfx ┆ b:dfx ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
# │ i64 ┆ i64 ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
# ╞═════╪═════╪═══════╪═══════╪═══════╪═══════╡
# │ 1   ┆ 1   ┆ 1     ┆ 1     ┆ 1     ┆ 1     │
# └─────┴─────┴───────┴───────┴───────┴───────┘

@nameexhaustion
Copy link
Collaborator Author

Thanks @alexander-beedie. I also like the clearer output from raising, so I've decided on that 🙂

@ritchie46 ritchie46 merged commit 109cc7e into pola-rs:main Feb 19, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: join may result in duplicate column names
3 participants