fix: Fix duplicate column names after join if suffix already present #21315

nameexhaustion · 2025-02-18T10:10:12Z

Fixes bug: join may result in duplicate column names #21048

Updates to use (newly added) Schema::try_insert() instead of Schema::with_column during IR resolution to catch duplicate errors.

Note, there are a lot more places in the codebase that should probably be using try_insert() instead of with_column.

nameexhaustion · 2025-02-18T12:09:43Z

crates/polars-plan/src/plans/schema.rs

+        // will result in
+        //
+        // df(cols=[B, A, B_right])
+        JoinType::Right if options.args.should_coalesce() => {


drive-by - we only need the dedicated right-join schema resolution branch for coalesce=True

nameexhaustion · 2025-02-18T12:34:05Z

*Edit - outdated, see further comments

~~Note that as a side effect of logic to prevent duplicate errors, this changes the output when joining on unnamed relations:~~

import polars as pl

df = pl.DataFrame({"a": 1, "b": 1})

with pl.SQLContext({"df1": df, "df2": df, "df3": df}) as ctx:
    q = ctx.execute("""\
SELECT *
FROM df1
JOIN (df2 JOIN df3 ON df2.a = df3.a) ON df1.a = df2.a
""")
    print(q.collect())

# Before
# shape: (1, 6)
# ┌─────┬─────┬─────┬─────┬───────┬───────┐
# │ a   ┆ b   ┆ a:  ┆ b:  ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ---   ┆ ---   │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64   ┆ i64   │
# ╞═════╪═════╪═════╪═════╪═══════╪═══════╡
# │ 1   ┆ 1   ┆ 1   ┆ 1   ┆ 1     ┆ 1     │
# └─────┴─────┴─────┴─────┴───────┴───────┘
# After
# shape: (1, 6)
# ┌─────┬─────┬───────────────────┬───────────────────┬───────┬───────┐
# │ a   ┆ b   ┆ a:__UNNAMED_TBL_0 ┆ b:__UNNAMED_TBL_0 ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ ---               ┆ ---               ┆ ---   ┆ ---   │
# │ i64 ┆ i64 ┆ i64               ┆ i64               ┆ i64   ┆ i64   │
# ╞═════╪═════╪═══════════════════╪═══════════════════╪═══════╪═══════╡
# │ 1   ┆ 1   ┆ 1                 ┆ 1                 ┆ 1     ┆ 1     │
# └─────┴─────┴───────────────────┴───────────────────┴───────┴───────┘

@alexander-beedie does this look okay to you?

codecov · 2025-02-18T12:39:31Z

Codecov Report

Attention: Patch coverage is 92.62295% with 9 lines in your changes missing coverage. Please review.

Project coverage is 79.92%. Comparing base (de1d9d5) to head (ca2e92f).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-sql/src/context.rs	68.75%	5 Missing ⚠️
crates/polars-plan/src/plans/schema.rs	94.20%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #21315      +/-   ##
==========================================
+ Coverage   79.85%   79.92%   +0.07%     
==========================================
  Files        1596     1596              
  Lines      228642   228716      +74     
  Branches     2615     2615              
==========================================
+ Hits       182571   182801     +230     
+ Misses      45473    45317     -156     
  Partials      598      598

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexander-beedie · 2025-02-18T12:47:59Z

@alexander-beedie does this look okay to you?

Hmm... it might actually be a better idea to raise an error here, to ensure that reasonable aliasing exists for subsequent selections/operations. Something like: polars_bail!(SQLInterface: "cannot join on unnamed relation; please provide an alias") 🤔

Some SQL dialects allow joining on unnamed relations, but others don't (I think Oracle is one of the latter, IIRC), and those that do have different ways to resolve the resulting column names; duckdb can return duplicate columns, for example...

import duckdb as dd

dd.sql("""
 SELECT * FROM df1
 JOIN (df2 JOIN df3 ON df2.a = df3.a) ON df1.a = df2.a
""")
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │   a   │   b   │   a   │   b   │   a   │   b   │
# │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┼───────┼───────┤
# │     1 │     1 │     1 │     1 │     1 │     1 │
# └───────┴───────┴───────┴───────┴───────┴───────┘

...or:

dd.sql("""
  SELECT * FROM df1
  JOIN (df2 JOIN df3 ON df2.a = df3.a) AS dfx ON df1.a = dfx.a
""")
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │   a   │   b   │   a   │   b   │  a_1  │  b_1  │
# │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┼───────┼───────┤
# │     1 │     1 │     1 │     1 │     1 │     1 │
# └───────┴───────┴───────┴───────┴───────┴───────┘

If we raise on unnamed join relations, we'll get clearer output when the SQL is adjusted to include an alias (like so):

pl.sql("""
  SELECT * FROM df1
  JOIN (df2 JOIN df3 ON df2.a = df3.a) AS dfx ON df1.a = dfx.a
""").collect()
# shape: (1, 6)
# ┌─────┬─────┬───────┬───────┬───────┬───────┐
# │ a   ┆ b   ┆ a:dfx ┆ b:dfx ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
# │ i64 ┆ i64 ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
# ╞═════╪═════╪═══════╪═══════╪═══════╪═══════╡
# │ 1   ┆ 1   ┆ 1     ┆ 1     ┆ 1     ┆ 1     │
# └─────┴─────┴───────┴───────┴───────┴───────┘

nameexhaustion · 2025-02-18T19:56:39Z

Thanks @alexander-beedie. I also like the clearer output from raising, so I've decided on that 🙂

c

229d965

nameexhaustion changed the title ~~fix: Fix duplicate column names after join if suffixed already present~~ fix: Fix duplicate column names after join if suffix already present Feb 18, 2025

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Feb 18, 2025

nameexhaustion added 6 commits February 18, 2025 21:29

c

d0e35ff

c

592c986

c

62fc47a

c

48e9229

fix sql 🥵

90999f2

uncomment test

b7f9dec

nameexhaustion commented Feb 18, 2025

View reviewed changes

nameexhaustion added 3 commits February 18, 2025 23:10

c

8e2be20

c

3ae482c

c

c3d6cfe

nameexhaustion added 4 commits February 19, 2025 06:12

c

7e36e51

add test forbid join unnamed

4d4f86f

adjust join perf test

53d1474

c

ca2e92f

nameexhaustion marked this pull request as ready for review February 18, 2025 19:57

nameexhaustion requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners February 18, 2025 19:57

ritchie46 merged commit 109cc7e into pola-rs:main Feb 19, 2025
27 checks passed

cmdlineluser mentioned this pull request Feb 24, 2025

1.23 join, duplicate column regression #21448

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix duplicate column names after join if suffix already present #21315

fix: Fix duplicate column names after join if suffix already present #21315

nameexhaustion commented Feb 18, 2025 •

edited

Loading

nameexhaustion Feb 18, 2025

nameexhaustion commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 18, 2025 •

edited

Loading

alexander-beedie commented Feb 18, 2025 •

edited

Loading

nameexhaustion commented Feb 18, 2025

fix: Fix duplicate column names after join if suffix already present #21315

fix: Fix duplicate column names after join if suffix already present #21315

Conversation

nameexhaustion commented Feb 18, 2025 • edited Loading

nameexhaustion Feb 18, 2025

Choose a reason for hiding this comment

nameexhaustion commented Feb 18, 2025 • edited Loading

codecov bot commented Feb 18, 2025 • edited Loading

Codecov Report

alexander-beedie commented Feb 18, 2025 • edited Loading

nameexhaustion commented Feb 18, 2025

nameexhaustion commented Feb 18, 2025 •

edited

Loading

nameexhaustion commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 18, 2025 •

edited

Loading

alexander-beedie commented Feb 18, 2025 •

edited

Loading