-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix duplicate column names after join if suffix already present #21315
Conversation
// will result in | ||
// | ||
// df(cols=[B, A, B_right]) | ||
JoinType::Right if options.args.should_coalesce() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by - we only need the dedicated right-join schema resolution branch for coalesce=True
*Edit - outdated, see further comments
import polars as pl
df = pl.DataFrame({"a": 1, "b": 1})
with pl.SQLContext({"df1": df, "df2": df, "df3": df}) as ctx:
q = ctx.execute("""\
SELECT *
FROM df1
JOIN (df2 JOIN df3 ON df2.a = df3.a) ON df1.a = df2.a
""")
print(q.collect())
# Before
# shape: (1, 6)
# ┌─────┬─────┬─────┬─────┬───────┬───────┐
# │ a ┆ b ┆ a: ┆ b: ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╪═══════╪═══════╡
# │ 1 ┆ 1 ┆ 1 ┆ 1 ┆ 1 ┆ 1 │
# └─────┴─────┴─────┴─────┴───────┴───────┘
# After
# shape: (1, 6)
# ┌─────┬─────┬───────────────────┬───────────────────┬───────┬───────┐
# │ a ┆ b ┆ a:__UNNAMED_TBL_0 ┆ b:__UNNAMED_TBL_0 ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═══════════════════╪═══════════════════╪═══════╪═══════╡
# │ 1 ┆ 1 ┆ 1 ┆ 1 ┆ 1 ┆ 1 │
# └─────┴─────┴───────────────────┴───────────────────┴───────┴───────┘ @alexander-beedie does this look okay to you? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #21315 +/- ##
==========================================
+ Coverage 79.85% 79.92% +0.07%
==========================================
Files 1596 1596
Lines 228642 228716 +74
Branches 2615 2615
==========================================
+ Hits 182571 182801 +230
+ Misses 45473 45317 -156
Partials 598 598 ☔ View full report in Codecov by Sentry. |
Hmm... it might actually be a better idea to raise an error here, to ensure that reasonable aliasing exists for subsequent selections/operations. Something like: Some SQL dialects allow joining on unnamed relations, but others don't (I think Oracle is one of the latter, IIRC), and those that do have different ways to resolve the resulting column names; import duckdb as dd
dd.sql("""
SELECT * FROM df1
JOIN (df2 JOIN df3 ON df2.a = df3.a) ON df1.a = df2.a
""")
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │ a │ b │ a │ b │ a │ b │
# │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┼───────┼───────┤
# │ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │
# └───────┴───────┴───────┴───────┴───────┴───────┘ ...or: dd.sql("""
SELECT * FROM df1
JOIN (df2 JOIN df3 ON df2.a = df3.a) AS dfx ON df1.a = dfx.a
""")
# ┌───────┬───────┬───────┬───────┬───────┬───────┐
# │ a │ b │ a │ b │ a_1 │ b_1 │
# │ int64 │ int64 │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┼───────┼───────┤
# │ 1 │ 1 │ 1 │ 1 │ 1 │ 1 │
# └───────┴───────┴───────┴───────┴───────┴───────┘ If we raise on unnamed join relations, we'll get clearer output when the SQL is adjusted to include an alias (like so): pl.sql("""
SELECT * FROM df1
JOIN (df2 JOIN df3 ON df2.a = df3.a) AS dfx ON df1.a = dfx.a
""").collect()
# shape: (1, 6)
# ┌─────┬─────┬───────┬───────┬───────┬───────┐
# │ a ┆ b ┆ a:dfx ┆ b:dfx ┆ a:df3 ┆ b:df3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═══════╪═══════╪═══════╪═══════╡
# │ 1 ┆ 1 ┆ 1 ┆ 1 ┆ 1 ┆ 1 │
# └─────┴─────┴───────┴───────┴───────┴───────┘ |
Thanks @alexander-beedie. I also like the clearer output from raising, so I've decided on that 🙂 |
join
may result in duplicate column names #21048Updates to use (newly added)
Schema::try_insert()
instead ofSchema::with_column
during IR resolution to catch duplicate errors.Note, there are a lot more places in the codebase that should probably be using
try_insert()
instead ofwith_column
.