feat: Support duplicate column names in Substrait consumer #11048

Blizzara · 2024-06-21T12:38:52Z

Which issue does this PR close?

Closes #10815 , and also adds support for joining two VirtualTables with overlapping column names.

Rationale for this change

Substrait doesn't contain aliases inside the plan: neither table aliases (SubqueryAlias), nor column aliases. Columns are only named at the beginning and end of the plan, while tables are not named by substrait at all. Substrait referes to columns only by their relative indices, so that's name-independent.

When we consumer Substrait plans' ReadRels, the created DataFusion tables may or may not have qualifiers. If the tables are NamedTable reads, then DataFusion uses the name as the qualifier. If the tables are VirtualTables, then we have unqualified tables.

This causes trouble in some cases for DataFusion, as DF requires columns to be uniquely named in at least two places:

when joining schemas, in DFSchema::check_names. This throws when there are two duplicate unqualified names, or if a unqualified name matches the name of a qualified name. This was problematic for joining VirtualTables since they would be unqualified.
Interestingly, the (1) does NOT throw when there are two duplicate qualified names. That's why the existing tests were working. However, this results in a schema that contains duplicate fields, which then fail if those fields are actually trying to be used - that's Projects require unique expressions names error in substrait producer/consumer #10815

What changes are included in this PR?

In Substrait consumer, we now qualify all tables resulting from a ReadRel, with a pseudo-qualifier if there is no qualifier originally. Also, we maintain a set of qualifiers we've seen so far, and we re-qualify tables with an incremental postfix if the original qualifier has been seen already.

This ensures all tables have unique qualifiers at least at the beginning of the plan.

Note: there is probably a chance this doesn't fix cases where a project happens before the join, since the project may lose the qualifiers and that may cause duplicates again. An alternative solution could be to check the names when creating the join in Substrait consumer, and (re-)qualify one or both sides of the join at that point in case of conflicts.

Are these changes tested?

Tested by some stricter roundtrip unit tests, + adapting existing tests.

Are there any user-facing changes?

Plans produced by Substrait consumer may get different qualifiers for columns. I think that's unlikely to affect anyone since I dunno why people would be looking at those qualifiers.

Otherwise just improved support for Substrait plans.

Blizzara · 2024-06-21T14:10:04Z

Closing in favor of #11049

Blizzara added 2 commits June 21, 2024 13:55

add tests for joining tables with same name

00f01bb

implement unique qualifiers for tables

a6b52ec

github-actions bot added the substrait label Jun 21, 2024

fix tests

1b1b300

Blizzara mentioned this pull request Jun 21, 2024

feat: Support duplicate column names in Joins in Substrait consumer #11049

Merged

Blizzara closed this Jun 21, 2024

Blizzara deleted the avo/substrait-uniquely-qualify-tables-for-joins branch June 28, 2024 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support duplicate column names in Substrait consumer #11048

feat: Support duplicate column names in Substrait consumer #11048

Blizzara commented Jun 21, 2024 •

edited

Loading

Blizzara commented Jun 21, 2024

feat: Support duplicate column names in Substrait consumer #11048

feat: Support duplicate column names in Substrait consumer #11048

Conversation

Blizzara commented Jun 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Blizzara commented Jun 21, 2024

Blizzara commented Jun 21, 2024 •

edited

Loading