-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(parsedquery): resolve conflict when aliases and tables have the same name #21535
fix(parsedquery): resolve conflict when aliases and tables have the same name #21535
Conversation
Related to this issue: #21207 |
e205f72
to
42a52b6
Compare
Codecov Report
@@ Coverage Diff @@
## master #21535 +/- ##
==========================================
- Coverage 66.68% 65.34% -1.34%
==========================================
Files 1793 1794 +1
Lines 68538 68687 +149
Branches 7282 7282
==========================================
- Hits 45702 44881 -821
- Misses 20974 21944 +970
Partials 1862 1862
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
81c34cf
to
0a92c17
Compare
0a92c17
to
5a507bb
Compare
Hi @zhaoyongjie @john-bodley @michael-s-molina I'm looking to upgrade to the superset 2.1 release but we've had to make a number of small patches and security fixes to our fork of the 2.0 release patch issues such as this one, which was contributed here by a former member of our team. I'm trying to figure out if this patch is still needed or if a similar fix has made it into the 2.1 release as there hasn't been any activity on the issue or PR here. Are you able to confirm whether the issue described here has been resolved? We'd also be interested in any guidance you can offer in how we can avoid straying too far from what's being worked on by the open source team. What we'd really like to do is work with your team to find sensible ways of discussing and contributing some of the features and short-term fixes we've made back to the project for the benefit of both our organisation and the wider community. As we've been working at a fairly fast pace we've unfortunately diverged to get patches and customisations in quick and avoid waiting for releases, and we haven't previously allowed time to reach out to your team to see what we can give back other than by raising a couple of PRs like this one. We'll try to reach out via your slack and discuss a way forward, but wanted to add a heads up here in case you can advise who best to speak to and how best to get in touch. Apologies for tagging you directly, I took your names from the top recent contributor list, and saw @zhaoyongjie previously had eyes on this PR 👀 |
Hi @giftig.
@eschutho will be the best person for this one as she was the release manager for 2.1
This is really nice. Contributions are always welcome. The best person to help you here is @rusackas who established a similar model with CCCS. |
Hi @eschutho would you be able to weigh in on the above? |
HI @giftig, thanks for the PR. I didn't read through it thoroughly so excuse me if I missed anything in the implementation, but at first glance, I do know that we added rls support to subqueries in 2.1.0. What version are you currently on? Also I agree on the suggestion to connect with @rusackas on contributing back and working more closely with the team here. Evan will be a great resource for getting you plugged in. |
@eschutho thanks, I'll take a look at the updates around there and try to retest the issue we were having here. We're currently upgrading to the 2.1 release (from 2.0.0), and merging in the changes we had on our fork in order to do the upgrade. Unfortunately with some of those changes, like this one, I don't have an especially good idea of what the problem was, how the fix works, or whether the 2.1 release has independently fixed it, so I thought I'd check here and if this is fixed in the 2.1 release we can close this PR and it's one fewer commit to try to merge in. I put this one at the end of my list to test while I checked with you so I'll try to verify the problem described here against 2.1 and if it's already fixed we can close this. Thanks to pointing me to the relevant changes in 2.1. Thanks @rusackas, I believe our product owner has already reached out to you and we have a call scheduled this week, so we'll speak shortly and I'll sign up to your slack as well. |
Hi, I've just run into this issue in 2.1 so I can confirm that the issue still exists. I'll attempt to rebase this patch onto the latest state of the project and raise it as a new PR. I'd like to flag that this is a serious security issue around bypassing datasource access permissions, though, so it'd be good to get some eyes on this issue from your side if possible. As @victorarbuesmallada has raised here, currently in superset a user with no permissions to access SELECT * FROM (SELECT * FROM forbidden_table) AS forbidden_table; |
@rusackas do you know who would be best to take a look at what's being done here? I see in git blame that several names have made major contributions to the SQL parse logic in this file. Obviously this is some of the most complex / low-level code so I'd like to get opinions on whether this fix is the "right" fix and what we can do to get this merged. Surprisingly I was able to cherry-pick this into 2.1 without any conflicts and am just retesting that the fix still works on our fork; it doesn't seem like the sql parse logic has changed much at all between 2.0 and 2.1. Hopefully this will make it easier to get this looked at and merged. Happy to make changes where needed and submit a new PR as Victor is no longer in our team and unlikely to be active here, and I notice your build is complaining about lack of test coverage in some parts / I'm sure someone on your team will have the context to come up with improvements for this fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a first round of comments
class TableTuple(NamedTuple): | ||
table: Table | ||
level: int | ||
|
||
class AliasTuple(NamedTuple): | ||
name: str | ||
level: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: while I'm having a hard time coming up with a better name for these, I'm not a big fan of Hungarian notation. An alternative could be TableWithLevel
and AliasWithLevel
to clarify the contents rather than the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a fellow hater of Hungarian notation, I'd like to stress I'm not the original author :D I'll change these to something nicer.
} | ||
result = self._extract_from_token(statement) | ||
if result: | ||
(table_tuples, alias_name_tuples) = result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing my naming rant.. 😄 Here we coud probably just call then tables
and aliases
, and then later extract the table_names
/table_levels
from them etc.
""" | ||
) | ||
== {Table("q2"), Table("src")} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we really sure that the q2
reference in the q1
CTE will in fact reference the table q2
in the database? I did the following test on my devenv which has examples data and the superset metadata in the same database, and
with q1 as ( select distinct gender from birth_names),
birth_names as ( select * from ab_role)
select * from (select gender from q1) a
produces the following error:
sqlite error: no such column: gender
This would indicate to me that it was looking for gender
in ab_role
, not the actual birth_names
table. So to me it appears as if only src
is in fact only referenced here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried something similar but it's definitely referencing the database table, not the alias which came after it:
postgres=# CREATE TABLE hodor(age INT, name VARCHAR);
CREATE TABLE
postgres=# INSERT INTO hodor(age, name) VALUES (35, 'Hodor');
INSERT 0 1
postgres=# CREATE TABLE potatoes(count INT);
CREATE TABLE
postgres=# INSERT INTO potatoes(count) VALUES (100);
INSERT 0 1
postgres=# WITH
result AS (SELECT * FROM hodor),
hodor AS (SELECT * FROM potatoes)
SELECT * FROM result;
age | name
-----+-------
35 | Hodor
(1 row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the point of the extra CTE in the final SELECT was here, but it didn't change the result for me either:
postgres=# WITH
result AS (SELECT * FROM hodor),
hodor AS (SELECT * FROM potatoes)
SELECT * FROM (SELECT * FROM result) final;
age | name
-----+-------
35 | Hodor
(1 row)
postgres=# WITH
result AS (SELECT * FROM hodor),
hodor AS (SELECT * FROM potatoes)
SELECT * FROM (SELECT age FROM result) final;
age
-----
35
(1 row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having said that, we're using different databases. It wouldn't surprise me at all if postgres and sqlite handled CTEs slightly differently. If that's the case this endeavour is slightly doomed though; we'd need a lot more logic here to handle various different databases.
Arguably it's impossible to maintain this without somehow asking the database for the table list instead of trying to work it out ourselves. Unfortunately for the db my team really cares about, trino, that's not readily possible either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried it with sqlite and indeed it behaves differently:
sqlite> CREATE TABLE hodor(age INT, name VARCHAR);
sqlite> INSERT INTO hodor(age, name) VALUES (35, 'Hodor');
sqlite> CREATE TABLE potatoes(count INT);
sqlite> INSERT INTO potatoes(count) VALUES (100);
sqlite> WITH
result AS (SELECT * FROM hodor),
hodor AS (SELECT * FROM potatoes)
SELECT * FROM result; ...> ...> ...>
100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok good to know, this is likely one of those cases where different dbs behave differently. To support both use cases we could probably add a flag on BaseEngineSpec
that specifies which variant the db in question supports. Something like cte_forward_alias_reference
that we'd set to False
for SQLite and True
for Postgres (we should probably do some additional testing to see which is the more common flavor before setting that on BaseEngineSpec
).
Not in the scope of this PR, but in the neat future it might be good to switch to |
@betodealmeida my team had similar ideas but suggested I'm in two minds about this PR now as it does have value to my team: we're very readily falling afoul of this particular issue as one of our jinja templates uses the alias-overriding-table pattern described in the issue, so we've kept it in our fork so that users can't escalate their permissions purely by accident. However, given @villebro 's comment on the test I'm not certain this doesn't worsen the experience for some users / backends by restricting legitimate access, and if that's a poor trade off in that sense. And what's clear is that this PR won't fix the real issue, we'd need a more fundamental change in approach to make this work as intended. |
@giftig @betodealmeida Incidentally, I've also been playing around with |
creator of sqlglot here. happy to chat. sqlglot has very advanced capabilities like multi dialect support, scope traversal, column qualification, and other optimizations. not exactly sure what you're looking to do but you can also do column level lineage, so you can determine if any select is selecting from any source. |
Going to close this one as it was always a bit too specific to my team's problem and the change has some issues which are complex to solve. We've moved in a different direction to solve the underlying issue we were seeing. Might still be an idea to discuss the polyglot approach to generally improving the logic / performance here in future though; sounds like a good candidate for a SIP. |
SUMMARY
We discovered that in queries that have aliases with the same name as tables, that removes said tables from the list returned in the
ParsedQuery
tables
method. This affects the way we prevent users to access to certain tables, since they can circumvent the restricted access by retrieving data through subqueries that have the same alias as the name of the involved table.In essence, this won't work:
select * from forbidden_table
This will work:
select * from (select * from forbidden_table) forbidden_table
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
A normal query shows that the table can't be accessed:
A query with a forbidden table on a subquery:
TESTING INSTRUCTIONS
Create a user that has only certain tables permission and try to perform a query with a forbidden table inside a subquery with the alias the same as the table name.
ADDITIONAL INFORMATION