feat: Support duplicate column names in Substrait consumer #11048
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #10815 , and also adds support for joining two VirtualTables with overlapping column names.
Rationale for this change
Substrait doesn't contain aliases inside the plan: neither table aliases (SubqueryAlias), nor column aliases. Columns are only named at the beginning and end of the plan, while tables are not named by substrait at all. Substrait referes to columns only by their relative indices, so that's name-independent.
When we consumer Substrait plans' ReadRels, the created DataFusion tables may or may not have qualifiers. If the tables are NamedTable reads, then DataFusion uses the name as the qualifier. If the tables are VirtualTables, then we have unqualified tables.
This causes trouble in some cases for DataFusion, as DF requires columns to be uniquely named in at least two places:
What changes are included in this PR?
In Substrait consumer, we now qualify all tables resulting from a ReadRel, with a pseudo-qualifier if there is no qualifier originally. Also, we maintain a set of qualifiers we've seen so far, and we re-qualify tables with an incremental postfix if the original qualifier has been seen already.
This ensures all tables have unique qualifiers at least at the beginning of the plan.
Note: there is probably a chance this doesn't fix cases where a project happens before the join, since the project may lose the qualifiers and that may cause duplicates again. An alternative solution could be to check the names when creating the join in Substrait consumer, and (re-)qualify one or both sides of the join at that point in case of conflicts.
Are these changes tested?
Tested by some stricter roundtrip unit tests, + adapting existing tests.
Are there any user-facing changes?
Plans produced by Substrait consumer may get different qualifiers for columns. I think that's unlikely to affect anyone since I dunno why people would be looking at those qualifiers.
Otherwise just improved support for Substrait plans.