Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: check ambiguous column reference #12467

Merged
merged 1 commit into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions datafusion/common/src/dfschema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,33 @@ impl DFSchema {
}
}

/// Check whether the column reference is ambiguous
pub fn check_ambiguous_name(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the newly introduced functions qualified_field_with_qualified_name and qualified_fields_with_qualified_name do more than what we need. They might be unnecessary.
If I understand correctly, this function can be implemented as follows:

pub fn check_ambiguous_name(
        &self,
        qualifier: Option<&TableReference>,
        name: &str,
    ) -> Result<()> {
        let count = self
            .iter()
            .filter(|(field_q, f)| match (field_q, qualifier) {
                (Some(q1), Some(q2)) => q1.resolved_eq(q2) && f.name() == name,
                (None, None) => f.name() == name,
                _ => false,
            })
            .take(2)
            .count();
        if count > 1 {
            _schema_err!(SchemaError::AmbiguousReference {
                field: Column {
                    relation: None,
                    name: name.to_string(),
                },
            })
        } else {
            Ok(())
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we no longer need to add these two functions🤔, qualified_field_with_qualified_name and qualified_fields_with_qualified_name, along with the modifications to qualified_field_with_name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions will check the ambiguous column reference, it's necessary, or we could not identify the case below:

create table t1(v1 int);

select count(*)
from t1
right outer join t1
on t1.v1 > 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need add check_ambiguous_name at L132.

The changes to qualified_field_with_name will cause the following query to produce an incorrect result.

DataFusion CLI v41.0.0
> create table t1(v1 int) as values(100);
0 row(s) fetched.

> select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
+--------+
| t1[v1] |
+--------+
| foo    |
+--------+

On the main branch, it gives

> select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
+-----+
| v1  |
+-----+
| 100 |
+-----+

This is because it breaks search_dfschema, making it unable to find t1.v1, but instead found the nested column t1.

Copy link
Contributor Author

@HuSen8891 HuSen8891 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column reference 't1.v1' is ambiguous in query.

select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);

Fix it later and also add to test case.

&self,
qualifier: Option<&TableReference>,
name: &str,
) -> Result<()> {
let count = self
.iter()
.filter(|(field_q, f)| match (field_q, qualifier) {
(Some(q1), Some(q2)) => q1.resolved_eq(q2) && f.name() == name,
(None, None) => f.name() == name,
_ => false,
})
.take(2)
.count();
if count > 1 {
_schema_err!(SchemaError::AmbiguousReference {
field: Column {
relation: None,
name: name.to_string(),
},
})
} else {
Ok(())
}
}

/// Find the qualified field with the given name
pub fn qualified_field_with_name(
&self,
Expand Down
12 changes: 11 additions & 1 deletion datafusion/sql/src/expr/identifier.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,12 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
nested_names,
) {
match planner_result {
PlannerResult::Planned(expr) => return Ok(expr),
PlannerResult::Planned(expr) => {
// sanity check on column
schema
.check_ambiguous_name(qualifier, field.name())?;
return Ok(expr);
}
PlannerResult::Original(_args) => {}
}
}
Expand All @@ -143,6 +148,8 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
// found matching field with no spare identifier(s)
Some((field, qualifier, _nested_names)) => {
// sanity check on column
schema.check_ambiguous_name(qualifier, field.name())?;
Ok(Expr::Column(Column::from((qualifier, field))))
}
None => {
Expand Down Expand Up @@ -186,6 +193,9 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
let s = &ids[0..ids.len()];
// safe unwrap as s can never be empty or exceed the bounds
let (relation, column_name) = form_identifier(s).unwrap();
// sanity check on column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment says, how about adding a check_ambiguous_qualified_name method to DFSchema?
I think this would be more intuitive and simpler. The code here can be changed to the following.

// sanity check on column
schema.check_ambiguous_qualified_name(relation.as_ref(), column_name)?;
Ok(Expr::Column(Column::new(relation, column_name)))

check_ambiguous_qualified_name is used to check whether the specified qualified name appears multiple times in the schema.

I suspect that we should not allow two identical qualified fields within a schema, but this would violate PR 6091. Therefore, introducing a method like check_ambiguous_qualified_name to be called when needed might be a workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That‘s good idea.
Add method 'check_ambiguous_name' to check ambiguous column reference.

schema
.check_ambiguous_name(relation.as_ref(), column_name)?;
Ok(Expr::Column(Column::new(relation, column_name)))
}
}
Expand Down
17 changes: 17 additions & 0 deletions datafusion/sqllogictest/test_files/join.slt
Original file line number Diff line number Diff line change
Expand Up @@ -1209,3 +1209,20 @@ drop table t1;

statement ok
drop table t2;

# Test SQLancer issue: https://github.com/apache/datafusion/issues/12337
statement ok
create table t1(v1 int) as values(100);

## Query with Ambiguous column reference
query error DataFusion error: Schema error: Ambiguous reference to unqualified field v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be an "Ambiguous reference to qualified field", but currently the SchemaError cannot describe it, so the current situation is acceptable to me.

select count(*)
from t1
right outer join t1
on t1.v1 > 0;

query error DataFusion error: Schema error: Ambiguous reference to unqualified field v1
select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


statement ok
drop table t1;