Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: check ambiguous column reference #12467

Merged
merged 1 commit into from
Sep 19, 2024
Merged

Conversation

HuSen8891
Copy link
Contributor

Which issue does this PR close?

Closes #12337

Rationale for this change

check ambiguous column reference in query

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Sep 15, 2024
@github-actions github-actions bot removed the core Core DataFusion crate label Sep 16, 2024
@HuSen8891 HuSen8891 force-pushed the datafusion branch 2 times, most recently from fcdee8d to a3ef9a2 Compare September 16, 2024 04:57
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @HuSen8891 -- this is a great contribution

I have one concern about handling Err but otherwise looks good to me.

I think it would be good if someone else (perhaps @jonahgao ) also took a look at this PR to see if we are missing anything else

Comment on lines 484 to 515
match column {
Column {
relation: Some(r),
name: column_name,
} => &r == qq && column_name == name,
_ => false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to express this logic I would find clearer would be

Suggested change
match column {
Column {
relation: Some(r),
name: column_name,
} => &r == qq && column_name == name,
_ => false,
}
if let Some(r) = column.relation.as_ref() {
r == qq && column.name == name
} else {
false
}

(though this is fine too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!Refine this part by the clearer way.

@@ -186,7 +186,22 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
let s = &ids[0..ids.len()];
// safe unwrap as s can never be empty or exceed the bounds
let (relation, column_name) = form_identifier(s).unwrap();
Ok(Expr::Column(Column::new(relation, column_name)))
// sanity check on column
match schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are trying to avoid matching on Err and converting to Ok because each Err results in at least one String allocation (for the error message)

Since this code is potentially invoked many times during planning the allocations can add up and slow down planning

Is there any way to use a method that doesn't return Err here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result returned is not Err in most cases, we can ignore the additional cost here?

@jonahgao
Copy link
Member

Thanks @HuSen8891. I plan to review it later today.

@@ -186,7 +186,22 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
let s = &ids[0..ids.len()];
// safe unwrap as s can never be empty or exceed the bounds
let (relation, column_name) = form_identifier(s).unwrap();
Ok(Expr::Column(Column::new(relation, column_name)))
// sanity check on column
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment says, how about adding a check_ambiguous_qualified_name method to DFSchema?
I think this would be more intuitive and simpler. The code here can be changed to the following.

// sanity check on column
schema.check_ambiguous_qualified_name(relation.as_ref(), column_name)?;
Ok(Expr::Column(Column::new(relation, column_name)))

check_ambiguous_qualified_name is used to check whether the specified qualified name appears multiple times in the schema.

I suspect that we should not allow two identical qualified fields within a schema, but this would violate PR 6091. Therefore, introducing a method like check_ambiguous_qualified_name to be called when needed might be a workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That‘s good idea.
Add method 'check_ambiguous_name' to check ambiguous column reference.

@@ -412,17 +412,32 @@ impl DFSchema {
}
}

/// Check whether the column reference is ambiguous
pub fn check_ambiguous_name(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the newly introduced functions qualified_field_with_qualified_name and qualified_fields_with_qualified_name do more than what we need. They might be unnecessary.
If I understand correctly, this function can be implemented as follows:

pub fn check_ambiguous_name(
        &self,
        qualifier: Option<&TableReference>,
        name: &str,
    ) -> Result<()> {
        let count = self
            .iter()
            .filter(|(field_q, f)| match (field_q, qualifier) {
                (Some(q1), Some(q2)) => q1.resolved_eq(q2) && f.name() == name,
                (None, None) => f.name() == name,
                _ => false,
            })
            .take(2)
            .count();
        if count > 1 {
            _schema_err!(SchemaError::AmbiguousReference {
                field: Column {
                    relation: None,
                    name: name.to_string(),
                },
            })
        } else {
            Ok(())
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we no longer need to add these two functions🤔, qualified_field_with_qualified_name and qualified_fields_with_qualified_name, along with the modifications to qualified_field_with_name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions will check the ambiguous column reference, it's necessary, or we could not identify the case below:

create table t1(v1 int);

select count(*)
from t1
right outer join t1
on t1.v1 > 0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need add check_ambiguous_name at L132.

The changes to qualified_field_with_name will cause the following query to produce an incorrect result.

DataFusion CLI v41.0.0
> create table t1(v1 int) as values(100);
0 row(s) fetched.

> select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
+--------+
| t1[v1] |
+--------+
| foo    |
+--------+

On the main branch, it gives

> select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
+-----+
| v1  |
+-----+
| 100 |
+-----+

This is because it breaks search_dfschema, making it unable to find t1.v1, but instead found the nested column t1.

Copy link
Contributor Author

@HuSen8891 HuSen8891 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column reference 't1.v1' is ambiguous in query.

select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);

Fix it later and also add to test case.

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me👍 Thanks @HuSen8891 ❤️

create table t1(v1 int) as values(100);

## Query with Ambiguous column reference
query error DataFusion error: Schema error: Ambiguous reference to unqualified field v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be an "Ambiguous reference to qualified field", but currently the SchemaError cannot describe it, so the current situation is acceptable to me.

@jonahgao jonahgao merged commit eaf8d16 into apache:main Sep 19, 2024
24 checks passed
@jonahgao
Copy link
Member

jonahgao commented Sep 19, 2024

We may check ambiguous names during search_dfschema to achieve one pass search. However, before doing this, it would be better to remove the following logic; otherwise, the code in dfschema.rs will become too redundant. After introducing qualified alias, we should no longer need it.

(Some(qq), None) => {
// the original field may now be aliased with a name that matches the
// original qualified name
let column = Column::from_qualified_name(f.name());
match column {
Column {
relation: Some(r),
name: column_name,
} => &r == qq && column_name == name,
_ => false,
}
}

On the other hand, having a separate check_ambiguous_name in the current implementation would be clearer.

Let's merge it first, and then explore other improve possibilities later.

on t1.v1 > 0;

query error DataFusion error: Schema error: Ambiguous reference to unqualified field v1
select t1.v1 from t1 join t1 using(v1) cross join (select struct('foo' as v1) as t1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A join SQL query with ambiguous table reference executes without error (SQLancer-NoREC)
3 participants