Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(parsedquery): resolve conflict when aliases and tables have the same name #21535

Closed

Conversation

victorarbuesmallada
Copy link

@victorarbuesmallada victorarbuesmallada commented Sep 21, 2022

SUMMARY

We discovered that in queries that have aliases with the same name as tables, that removes said tables from the list returned in the ParsedQuery tables method. This affects the way we prevent users to access to certain tables, since they can circumvent the restricted access by retrieving data through subqueries that have the same alias as the name of the involved table.
In essence, this won't work:
select * from forbidden_table
This will work:
select * from (select * from forbidden_table) forbidden_table

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

A normal query shows that the table can't be accessed:

blocked

A query with a forbidden table on a subquery:

allowed

TESTING INSTRUCTIONS

Create a user that has only certain tables permission and try to perform a query with a forbidden table inside a subquery with the alias the same as the table name.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@victorarbuesmallada
Copy link
Author

Related to this issue: #21207

@codecov
Copy link

codecov bot commented Sep 21, 2022

Codecov Report

Merging #21535 (a3a44b2) into master (4d12e37) will decrease coverage by 1.33%.
The diff coverage is 42.00%.

@@            Coverage Diff             @@
##           master   #21535      +/-   ##
==========================================
- Coverage   66.68%   65.34%   -1.34%     
==========================================
  Files        1793     1794       +1     
  Lines       68538    68687     +149     
  Branches     7282     7282              
==========================================
- Hits        45702    44881     -821     
- Misses      20974    21944     +970     
  Partials     1862     1862              
Flag Coverage Δ
hive 52.84% <38.00%> (-0.24%) ⬇️
mysql 78.11% <42.00%> (-0.10%) ⬇️
postgres 78.16% <42.00%> (-0.11%) ⬇️
presto 52.74% <38.00%> (-0.24%) ⬇️
python 78.58% <42.00%> (-2.83%) ⬇️
sqlite ?
unit ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/sql_parse.py 62.20% <42.00%> (-34.87%) ⬇️
superset/tables/schemas.py 0.00% <0.00%> (-100.00%) ⬇️
superset/columns/schemas.py 0.00% <0.00%> (-100.00%) ⬇️
...set/advanced_data_type/plugins/internet_address.py 16.32% <0.00%> (-79.60%) ⬇️
superset/utils/pandas_postprocessing/boxplot.py 20.51% <0.00%> (-79.49%) ⬇️
superset/charts/post_processing.py 11.76% <0.00%> (-77.95%) ⬇️
...perset/advanced_data_type/plugins/internet_port.py 18.75% <0.00%> (-77.09%) ⬇️
superset/utils/pandas_postprocessing/rolling.py 21.87% <0.00%> (-68.75%) ⬇️
...perset/utils/pandas_postprocessing/contribution.py 34.61% <0.00%> (-65.39%) ⬇️
superset/utils/pandas_postprocessing/resample.py 37.50% <0.00%> (-62.50%) ⬇️
... and 66 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@victorarbuesmallada victorarbuesmallada force-pushed the parse-subqueries branch 6 times, most recently from 81c34cf to 0a92c17 Compare September 22, 2022 13:47
@giftig
Copy link
Contributor

giftig commented Apr 20, 2023

Hi @zhaoyongjie @john-bodley @michael-s-molina I'm looking to upgrade to the superset 2.1 release but we've had to make a number of small patches and security fixes to our fork of the 2.0 release patch issues such as this one, which was contributed here by a former member of our team.

I'm trying to figure out if this patch is still needed or if a similar fix has made it into the 2.1 release as there hasn't been any activity on the issue or PR here. Are you able to confirm whether the issue described here has been resolved?

We'd also be interested in any guidance you can offer in how we can avoid straying too far from what's being worked on by the open source team. What we'd really like to do is work with your team to find sensible ways of discussing and contributing some of the features and short-term fixes we've made back to the project for the benefit of both our organisation and the wider community. As we've been working at a fairly fast pace we've unfortunately diverged to get patches and customisations in quick and avoid waiting for releases, and we haven't previously allowed time to reach out to your team to see what we can give back other than by raising a couple of PRs like this one. We'll try to reach out via your slack and discuss a way forward, but wanted to add a heads up here in case you can advise who best to speak to and how best to get in touch.

Apologies for tagging you directly, I took your names from the top recent contributor list, and saw @zhaoyongjie previously had eyes on this PR 👀

@michael-s-molina
Copy link
Member

Hi @giftig.

I'm trying to figure out if this patch is still needed or if a similar fix has made it into the 2.1 release as there hasn't been any activity on the issue or PR here. Are you able to confirm whether the issue described here has been resolved?

@eschutho will be the best person for this one as she was the release manager for 2.1

We'd also be interested in any guidance you can offer in how we can avoid straying too far from what's being worked on by the open source team. What we'd really like to do is work with your team to find sensible ways of discussing and contributing some of the features and short-term fixes we've made back to the project for the benefit of both our organisation and the wider community. As we've been working at a fairly fast pace we've unfortunately diverged to get patches and customisations in quick and avoid waiting for releases, and we haven't previously allowed time to reach out to your team to see what we can give back other than by raising a couple of PRs like this one. We'll try to reach out via your slack and discuss a way forward, but wanted to add a heads up here in case you can advise who best to speak to and how best to get in touch.

This is really nice. Contributions are always welcome. The best person to help you here is @rusackas who established a similar model with CCCS.

@giftig
Copy link
Contributor

giftig commented Apr 25, 2023

Hi @eschutho would you be able to weigh in on the above?

@eschutho
Copy link
Member

Hi @eschutho would you be able to weigh in on the above?

HI @giftig, thanks for the PR. I didn't read through it thoroughly so excuse me if I missed anything in the implementation, but at first glance, I do know that we added rls support to subqueries in 2.1.0. What version are you currently on?

Also I agree on the suggestion to connect with @rusackas on contributing back and working more closely with the team here. Evan will be a great resource for getting you plugged in.

@rusackas
Copy link
Member

@giftig happy to chat! I don't see you on Slack or see an email on your GitHub profile, so feel free to DM me on Slack if you join, and we can take it from there!

@giftig
Copy link
Contributor

giftig commented May 2, 2023

@eschutho thanks, I'll take a look at the updates around there and try to retest the issue we were having here. We're currently upgrading to the 2.1 release (from 2.0.0), and merging in the changes we had on our fork in order to do the upgrade. Unfortunately with some of those changes, like this one, I don't have an especially good idea of what the problem was, how the fix works, or whether the 2.1 release has independently fixed it, so I thought I'd check here and if this is fixed in the 2.1 release we can close this PR and it's one fewer commit to try to merge in.

I put this one at the end of my list to test while I checked with you so I'll try to verify the problem described here against 2.1 and if it's already fixed we can close this. Thanks to pointing me to the relevant changes in 2.1.

Thanks @rusackas, I believe our product owner has already reached out to you and we have a call scheduled this week, so we'll speak shortly and I'll sign up to your slack as well.

@giftig
Copy link
Contributor

giftig commented May 4, 2023

Hi, I've just run into this issue in 2.1 so I can confirm that the issue still exists. I'll attempt to rebase this patch onto the latest state of the project and raise it as a new PR. I'd like to flag that this is a serious security issue around bypassing datasource access permissions, though, so it'd be good to get some eyes on this issue from your side if possible.

As @victorarbuesmallada has raised here, currently in superset a user with no permissions to access forbidden_table can do so regardless using a query like

SELECT * FROM (SELECT * FROM forbidden_table) AS forbidden_table;

@giftig
Copy link
Contributor

giftig commented May 5, 2023

@rusackas do you know who would be best to take a look at what's being done here? I see in git blame that several names have made major contributions to the SQL parse logic in this file. Obviously this is some of the most complex / low-level code so I'd like to get opinions on whether this fix is the "right" fix and what we can do to get this merged.

Surprisingly I was able to cherry-pick this into 2.1 without any conflicts and am just retesting that the fix still works on our fork; it doesn't seem like the sql parse logic has changed much at all between 2.0 and 2.1. Hopefully this will make it easier to get this looked at and merged. Happy to make changes where needed and submit a new PR as Victor is no longer in our team and unlikely to be active here, and I notice your build is complaining about lack of test coverage in some parts / I'm sure someone on your team will have the context to come up with improvements for this fix.

@rusackas rusackas requested review from dpgaspar, villebro and eschutho May 8, 2023 21:56
Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a first round of comments

Comment on lines +190 to +196
class TableTuple(NamedTuple):
table: Table
level: int

class AliasTuple(NamedTuple):
name: str
level: int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: while I'm having a hard time coming up with a better name for these, I'm not a big fan of Hungarian notation. An alternative could be TableWithLevel and AliasWithLevel to clarify the contents rather than the type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a fellow hater of Hungarian notation, I'd like to stress I'm not the original author :D I'll change these to something nicer.

}
result = self._extract_from_token(statement)
if result:
(table_tuples, alias_name_tuples) = result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing my naming rant.. 😄 Here we coud probably just call then tables and aliases, and then later extract the table_names/table_levels from them etc.

Comment on lines +561 to +564
"""
)
== {Table("q2"), Table("src")}
)
Copy link
Member

@villebro villebro May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we really sure that the q2 reference in the q1 CTE will in fact reference the table q2 in the database? I did the following test on my devenv which has examples data and the superset metadata in the same database, and

with q1 as ( select distinct gender from birth_names),
birth_names as ( select * from ab_role)
select * from (select gender from q1) a

produces the following error:

sqlite error: no such column: gender

This would indicate to me that it was looking for gender in ab_role, not the actual birth_names table. So to me it appears as if only src is in fact only referenced here, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried something similar but it's definitely referencing the database table, not the alias which came after it:

postgres=# CREATE TABLE hodor(age INT, name VARCHAR);
CREATE TABLE
postgres=# INSERT INTO hodor(age, name) VALUES (35, 'Hodor');
INSERT 0 1
postgres=# CREATE TABLE potatoes(count INT);
CREATE TABLE
postgres=# INSERT INTO potatoes(count) VALUES (100);
INSERT 0 1
postgres=# WITH
  result AS (SELECT * FROM hodor),
  hodor AS (SELECT * FROM potatoes)
SELECT * FROM result;
 age | name  
-----+-------
  35 | Hodor
(1 row)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the point of the extra CTE in the final SELECT was here, but it didn't change the result for me either:

postgres=# WITH
  result AS (SELECT * FROM hodor),
  hodor AS (SELECT * FROM potatoes)
SELECT * FROM (SELECT * FROM result) final;
 age | name  
-----+-------
  35 | Hodor
(1 row)

postgres=# WITH
  result AS (SELECT * FROM hodor),
  hodor AS (SELECT * FROM potatoes)
SELECT * FROM (SELECT age FROM result) final;
 age 
-----
  35
(1 row)

Copy link
Contributor

@giftig giftig May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, we're using different databases. It wouldn't surprise me at all if postgres and sqlite handled CTEs slightly differently. If that's the case this endeavour is slightly doomed though; we'd need a lot more logic here to handle various different databases.

Arguably it's impossible to maintain this without somehow asking the database for the table list instead of trying to work it out ourselves. Unfortunately for the db my team really cares about, trino, that's not readily possible either.

Copy link
Contributor

@giftig giftig May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it with sqlite and indeed it behaves differently:

sqlite> CREATE TABLE hodor(age INT, name VARCHAR);
sqlite> INSERT INTO hodor(age, name) VALUES (35, 'Hodor');
sqlite> CREATE TABLE potatoes(count INT);
sqlite> INSERT INTO potatoes(count) VALUES (100);
sqlite> WITH
  result AS (SELECT * FROM hodor),
  hodor AS (SELECT * FROM potatoes)
SELECT * FROM result;   ...>    ...>    ...> 
100

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok good to know, this is likely one of those cases where different dbs behave differently. To support both use cases we could probably add a flag on BaseEngineSpec that specifies which variant the db in question supports. Something like cte_forward_alias_reference that we'd set to False for SQLite and True for Postgres (we should probably do some additional testing to see which is the more common flavor before setting that on BaseEngineSpec).

@betodealmeida
Copy link
Member

Not in the scope of this PR, but in the neat future it might be good to switch to sqloxide to check for tables. We're already using it were performance is important, but it's also more robust in determining the accessed tables (from my experience).

@giftig
Copy link
Contributor

giftig commented May 15, 2023

@betodealmeida my team had similar ideas but suggested sqlglot, for similar reasons. I haven't got much experience with either tool so not sure which one would do the job better, but I think it's clear from the conversation above about different results with different databases that the issue can't be fixed without a major change to the approach, e.g. tackling this with a more robust tool for the job like sqlglot or sqloxide.

I'm in two minds about this PR now as it does have value to my team: we're very readily falling afoul of this particular issue as one of our jinja templates uses the alias-overriding-table pattern described in the issue, so we've kept it in our fork so that users can't escalate their permissions purely by accident. However, given @villebro 's comment on the test I'm not certain this doesn't worsen the experience for some users / backends by restricting legitimate access, and if that's a poor trade off in that sense. And what's clear is that this PR won't fix the real issue, we'd need a more fundamental change in approach to make this work as intended.

@villebro
Copy link
Member

@betodealmeida my team had similar ideas but suggested sqlglot, for similar reasons. I haven't got much experience with either tool so not sure which one would do the job better, but I think it's clear from the conversation above about different results with different databases that the issue can't be fixed without a major change to the approach, e.g. tackling this with a more robust tool for the job like sqlglot or sqloxide.

I'm in two minds about this PR now as it does have value to my team: we're very readily falling afoul of this particular issue as one of our jinja templates uses the alias-overriding-table pattern described in the issue, so we've kept it in our fork so that users can't escalate their permissions purely by accident. However, given @villebro 's comment on the test I'm not certain this doesn't worsen the experience for some users / backends by restricting legitimate access, and if that's a poor trade off in that sense. And what's clear is that this PR won't fix the real issue, we'd need a more fundamental change in approach to make this work as intended.

@giftig @betodealmeida Incidentally, I've also been playing around with sqlglot lately and been very impressed with it (check the benchmark results which are pretty impressive). However I can't comment on which is better/worse, but my gut feeling is that if either makes it possible to deprecate sqlparse I think we should consider switching (having said that, it seems like 0.4.2-0.4.4 have come out with a pretty healthy cadence after a long hiatus).

@tobymao
Copy link

tobymao commented May 16, 2023

creator of sqlglot here. happy to chat. sqlglot has very advanced capabilities like multi dialect support, scope traversal, column qualification, and other optimizations.

not exactly sure what you're looking to do but you can also do column level lineage, so you can determine if any select is selecting from any source.

@giftig
Copy link
Contributor

giftig commented Dec 19, 2023

Going to close this one as it was always a bit too specific to my team's problem and the change has some issues which are complex to solve. We've moved in a different direction to solve the underlying issue we were seeing.

Might still be an idea to discuss the polyglot approach to generally improving the logic / performance here in future though; sounds like a good candidate for a SIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants