Make schema name for the CTA queries and limit configurable #8867

bkyryliuk · 2019-12-19T01:51:48Z

Description

This PR implements an ability to configure the logic that generates the schema name for the create table as queries based on the database, user, used schema and executed SQL.
There are 2 new parameters in the config, both are off by default.

# Function accepts database object, user object, schema name and sql that will be run.
SQLLAB_CTA_SCHEMA_NAME_FUNC = (
    None
)  # type: Optional[Callable[["Database", "User", str, str], str]]
# Flag that controls if limit should be enforced on the CTA (create table as queries).
SQLLAB_CTA_NO_LIMIT = False

Use cases

automatically put sensitive data into the sensitive schema and that data is accessed
put user CTA results into the user personal schema, it can be extended to the team schema's as well

Other changes

get_query_with_new_limit function is modified no to apply limit if the user limit is lower, e.g. if user runs SELECT * FROM user LIMIT 1 - it would return 1 row, not 1000
fixed bug for the Postgres database where CTA results were not committed and table was not created e.g. https://github.com/apache/incubator-superset/blob/df2ee5cbcb8bc6e5cd310b9b9015509db744a256/tests/celery_tests.py#L175
created additional schemas in travis for testing purposes
removed unused SQL_SELECT_AS_CTA field from the test config.

Tests

[x] unit test + integration tests
[x] tested locally on mysql & postgres
[x] is used in production @ dropbox for over a month

request-info · 2019-12-19T01:51:51Z

We would appreciate it if you could provide us with more info about this issue/pr! Please do not leave the title or description empty.

mistercrunch · 2019-12-21T00:32:34Z

superset/sql_lab.py

@@ -366,6 +368,11 @@ def execute_sql_statements(
                    payload = handle_query_error(msg, query, session, payload)
                    return payload

+        # Commit the connection so CTA queries will create the table.
+        # TODO(bk): consider if it's only needed for postgres.
+        if conn:


When would conn be falsy?

mistercrunch · 2019-12-21T00:42:06Z

superset/sql_parse.py

@@ -222,7 +222,10 @@ def get_query_with_new_limit(self, new_limit: int) -> str:
                limit_pos = pos
                break
        _, limit = statement.token_next(idx=limit_pos)
-        if limit.ttype == sqlparse.tokens.Literal.Number.Integer:
+        # Override the limit only when it exceeds the configured value.
+        if limit.ttype == sqlparse.tokens.Literal.Number.Integer and new_limit < int(


Mmmh, this here in theory changes what the function is expected to do ("returns the query with the specified limit"), so either we change the name/docstring to reflect that, or either we move the conditional logic towards where the function is called.

Updated the docstring, yeah there is a mismatch. It's hard to move this condition outside of this function as it needs to get the existing limit value from the query and would require to parse query twice. I can't think about the usecase where we would want to override lower user limit with the higher configured value, e.g. I would expect to see 1 row when I query select * from bla limit 1 rather than 100.

mistercrunch · 2019-12-21T00:49:17Z

superset/config.py

+# Flag that controls if limit should be inforced on the create table as queries.
+SQLLAB_CTA_NO_LIMIT = False
+
+# Function accepts username, schema name and sql that will be run e.g.:


It's not super clear here what this does or why you'd want to use this config element. I had to read a bit of code to understand it.

This allows you to define custom logic around the "CREATE TABLE AS" or CTAS feautre in SQL Lab that defines where the target schema should be for a given user.

Then add a proper example with type annotation

I think we want the database object as a param too as the policy may differ across databases. Also could be good to pass in the user object instead of the username in case someone would want to dig into roles/perms.

mistercrunch · 2019-12-21T00:54:22Z

superset/views/core.py

@@ -2591,6 +2600,15 @@ def sql_json_exec(
        # Set tmp_table_name for CTA
        if select_as_cta and mydb.force_ctas_schema:
            tmp_table_name = f"{mydb.force_ctas_schema}.{tmp_table_name}"
+        elif select_as_cta:
+            dest_schema_name = get_cta_schema_name(
+                schema, sql, g.user.username if g.user else None


g.user will only be there if you're in sync mode (inside the scope of a web request), won't work on Celery.

views/core.py is in the scope of the web request, this happens before query object is created and passed to the celery worker - we should be safe here.

mistercrunch · 2019-12-21T00:55:39Z

superset/views/core.py

@@ -2644,8 +2662,9 @@ def sql_json_exec(
            )

        # set LIMIT after template processing
-        limits = [mydb.db_engine_spec.get_limit_from_sql(rendered_query), limit]
-        query.limit = min(lim for lim in limits if lim is not None)
+        if not config.get("SQLLAB_CTA_NO_LIMIT", False):


We can assume that the key exists no need for default to False here, and if would be False anyways if the key doesn't exist

good point :)

villebro

Very nice refactor! My only remaining issue is with the following scenarios:

user enters a fully qualified name in the table field (highlighted in you TODO comment)
user has chars in CTAS schema/table name that require quotes in fully qualified name.

I think we can live with these restrictions for now, so leaning towards getting this merged soon. Let me test this locally before final approval.

villebro · 2020-02-21T06:18:48Z

superset/migrations/versions/72428d1ea401_add_tmp_schema_name_to_the_query_object.py

+
+def upgrade():
+    op.add_column(
+        "query", sa.Column("tmp_schema_name", sa.String(length=256), nullable=True)


(not to other reviewers) At first I was confused by the choice of column name here, but turns out there was already tmp_table_name for the CTAS target target table. A more appropriate name might be target_table_name or ctas_table_name, but no point in changing the current convention in this PR.

villebro · 2020-02-21T06:38:11Z

tests/celery_tests.py

+            # sqlite doesn't support schemas
+            return
+        tmp_table_name = "tmp_async_22"
+        expected_full_table_name = f"{CTAS_SCHEMA_NAME}.{tmp_table_name}"


Back to the topic of quoting.. select_star in BaseEngineSpec quotes schema and table_name, and I'm wondering if we shouldn't do that here (views/core.py:Superset.sql_json_exec), too. If not, then we just need to assume users won't be doing CTAS into schemas/tables with periods, which seems like a reasonable assumption, for now.

I am open to either, quote while generating SQL or quote early on when getting the user input.
I think both approaches are good, the latter one would be a bit involved as table object creation should be quoted for SQLATable, Column and Metric. This test case just demonstrates the existing behavior, modifying it probably would be out of scope of this change, but definitely a useful improvement.

Let's not push too much complexity into this PR, we can deal with this later.

Fixing unit tests Fix table quoting Mypy Split tests out for sqlite Grant more permissions for mysql user Postgres doesn't support if not exists More logging Commit for table creation Priviliges for postgres Update tests Resolve comments Lint No limits for the CTA queries if configures

villebro

Tested locally and works well. Had missed the type comments before, so would like to get those fixed; beyond that this looks good to go for me.

villebro · 2020-02-26T20:41:12Z

superset/config.py

+SQLLAB_CTA_SCHEMA_NAME_FUNC = (
+    None
+)  # type: Optional[Callable[["Database", "models.User", str, str], str]]


A bit of a nit, but as Superset has deprecated support for python 2.7, we prefer regular type annotations over type comments, i.e.

SQLLAB_CTA_SCHEMA_NAME_FUNC: Optional[ Callable[["Database", "models.User", str, str], str] ] = None

villebro · 2020-02-26T20:46:43Z

superset/views/core.py

+    func = config.get(
+        "SQLLAB_CTA_SCHEMA_NAME_FUNC"
+    )  # type: Optional[Callable[[Database, ab_models.User, str, str], str]]


Same as above. Also, we prefer using square brackets when reading configs, i.e. config["SQLLAB_CTA_SCHEMA_NAME_FUNC"] to avoid having to check for None (doesn't really apply in this case, but it's a convention nonetheless).

villebro · 2020-02-26T20:46:57Z

superset/views/core.py

+        # Set tmp_schema_name for CTA
+        # TODO(bkyryliuk): consider parsing, splitting tmp_schema_name from tmp_table_name if user enters
+        # <schema_name>.<table_name>
+        tmp_schema_name = schema  # type: Optional[str]


villebro

LGTM. @mistercrunch @john-bodley @dpgaspar a second opinion would be good here, as the review process has been fairly lengthy with lots of back and forth, possibly resulting in us overlooking something important.

john-bodley · 2020-02-27T06:52:51Z

superset/sql_parse.py

        :param overwrite: table_name will be dropped if true
        :return: Create table as query
        """
        exec_sql = ""
        sql = self.stripped()
+        # TODO(bkyryliuk): quote full_table_name
+        full_table_name = f"{schema_name}.{table_name}" if schema_name else table_name


Shouldn’t we address the TODO? Note the quoter needs to be dialect specific.

@john-bodley this would be an additional feature. I kept the logic as it was before.
It is worth to resolve this todo and it is existing bug in superset, but I think this PR is not a right place for the fix as it is already quite large & hard to review and comprehend.

john-bodley · 2020-02-27T06:57:09Z

.travis.yml

@@ -64,8 +64,10 @@ jobs:
        - redis-server
      before_script:
        - mysql -u root -e "DROP DATABASE IF EXISTS superset; CREATE DATABASE superset DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci"
+        - mysql -u root -e "DROP DATABASE IF EXISTS sqllab_test_db; CREATE DATABASE sqllab_test_db DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci"


It’s not clear from this PR why we need these additional two databases.

@john-bodley this is purely for testing, e.g. there is a need to have 2 different schemas in the mysql & postgres to test the CTA behavior

villebro · 2020-02-29T18:44:31Z

As this PR includes a db migration, it would be nice to merge soon if there are no objections. I'll leave this open for a few more days, then merging if no further comments surface.

request-info bot added the need:more-info Requires more information from author label Dec 19, 2019

pull-request-size bot added the size/M label Dec 19, 2019

bkyryliuk force-pushed the bogdan/cta_userschema branch 17 times, most recently from fabcf75 to 99b5b6c Compare December 20, 2019 22:27

mistercrunch reviewed Dec 21, 2019

View reviewed changes

bkyryliuk force-pushed the bogdan/cta_userschema branch 5 times, most recently from c52c861 to 47c7e38 Compare December 26, 2019 23:00

bkyryliuk force-pushed the bogdan/cta_userschema branch 2 times, most recently from 861a68d to 07787c2 Compare February 20, 2020 19:28

villebro reviewed Feb 21, 2020

View reviewed changes

bkyryliuk force-pushed the bogdan/cta_userschema branch from 07787c2 to 5117648 Compare February 24, 2020 18:04

bogdan-dbx added 10 commits February 26, 2020 10:27

CTA -> CTAS and dict -> {}

0f4d5a1

Move database creation to the .travis file

763a8ef

Black

93a393f

Move tweaks to travis db setup

6e7dc8c

Remove left over version

abe1bc7

Address comments

2df033a

Quote table names in the CTAS queries

689967c

Pass tmp_schema_name for the query execution

e2a96e5

Rebase alembic migration

6661e39

bkyryliuk force-pushed the bogdan/cta_userschema branch from 5117648 to 6661e39 Compare February 26, 2020 18:27

villebro requested changes Feb 26, 2020

View reviewed changes

bogdan-dbx added 3 commits February 26, 2020 13:06

Switch to python3 mypy

4cc14b2

SQLLAB_CTA_SCHEMA_NAME_FUNC -> SQLLAB_CTAS_SCHEMA_NAME_FUNC

802fcae

Black

6c72dcc

villebro approved these changes Feb 27, 2020

View reviewed changes

john-bodley reviewed Feb 27, 2020

View reviewed changes

villebro merged commit 4e1fa95 into apache:master Mar 3, 2020

john-bodley added a commit that referenced this pull request Mar 5, 2020

[UPDATING] Adding notes regarding #8867

572e6df

john-bodley mentioned this pull request Mar 5, 2020

[UPDATING] Adding notes regarding #8867 #9246

Merged

12 tasks

villebro added a commit that referenced this pull request Mar 5, 2020

[UPDATING] Adding notes regarding #8867 (#9246)

4ffee8c

john-bodley mentioned this pull request Aug 12, 2020

feat: add extra column to tables and sql_metrics #10592

Merged

6 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024

cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024

[UPDATING] Adding notes regarding apache#8867

6bdb409

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make schema name for the CTA queries and limit configurable #8867

Make schema name for the CTA queries and limit configurable #8867

bkyryliuk commented Dec 19, 2019 •

edited

Loading

request-info bot commented Dec 19, 2019

mistercrunch Dec 21, 2019

mistercrunch Dec 21, 2019

bkyryliuk Dec 27, 2019

mistercrunch Dec 21, 2019

mistercrunch Dec 21, 2019

mistercrunch Dec 21, 2019

bkyryliuk Dec 27, 2019

mistercrunch Dec 21, 2019

bkyryliuk Dec 27, 2019

villebro left a comment •

edited

Loading

villebro Feb 21, 2020

villebro Feb 21, 2020

bkyryliuk Feb 24, 2020

villebro Feb 24, 2020

villebro left a comment

villebro Feb 26, 2020

villebro Feb 26, 2020

villebro Feb 26, 2020

villebro left a comment

john-bodley Feb 27, 2020

bkyryliuk Feb 27, 2020

john-bodley Feb 27, 2020

bkyryliuk Feb 27, 2020

villebro commented Feb 29, 2020

Make schema name for the CTA queries and limit configurable #8867

Make schema name for the CTA queries and limit configurable #8867

Conversation

bkyryliuk commented Dec 19, 2019 • edited Loading

Description

Use cases

Other changes

Tests

request-info bot commented Dec 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

villebro left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

villebro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

villebro commented Feb 29, 2020

bkyryliuk commented Dec 19, 2019 •

edited

Loading

villebro left a comment •

edited

Loading