Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(hive): Update CSV to Hive upload prefix #14255

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions UPDATING.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ assists people when migrating to a new version.

- [13980](https://github.com/apache/superset/pull/13980): Data health checks no longer use the metadata database as an interim cache. Though non-breaking, deployments which implement complex logic should likely memoize the callback function. Refer to documentation in the confg.py file for more detail.

- [14255](https://github.com/apache/superset/pull/14255): The default `CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC` callable logic has been updated to leverage the specified database and schema to ensure the upload S3 key prefix is unique. Previously tables generated via upload from CSV with the same name but differ schema and/or cluster would use the same S3 key prefix. Note this change does not impact previously imported tables.

### Breaking Changes
### Potential Downtime
### Deprecations
Expand Down
13 changes: 10 additions & 3 deletions superset/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -791,9 +791,16 @@ class CeleryConfig: # pylint: disable=too-few-public-methods
CSV_TO_HIVE_UPLOAD_DIRECTORY = "EXTERNAL_HIVE_TABLES/"
# Function that creates upload directory dynamically based on the
# database used, user and schema provided.
CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC: Callable[
["Database", "models.User", str], Optional[str]
] = lambda database, user, schema: CSV_TO_HIVE_UPLOAD_DIRECTORY
def CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC(
database: "Database",
user: "models.User", # pylint: disable=unused-argument
schema: Optional[str],
) -> str:
Copy link
Member Author

@john-bodley john-bodley Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the return type has changed from Optional[str] to str since a path must be specified.

# Note the final empty path enforces a trailing slash.
return os.path.join(
CSV_TO_HIVE_UPLOAD_DIRECTORY, str(database.id), schema or "", ""
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using database.database_name instead of database.id to have more readable file paths (as well as allowing re-creating database yet still uploading to the same directory)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud that was my original thinking as well, however these can have spaces etc. in them which could be problematic for S3 keys. I thought a database ID was the safest and also remains unchanged should someone rename a database (the name is purely for presentation purposes).



# The namespace within hive where the tables created from
# uploading CSVs will be stored.
Expand Down
24 changes: 20 additions & 4 deletions tests/db_engine_specs/hive_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ def test_create_table_from_csv_append() -> None:

@mock.patch(
"superset.db_engine_specs.hive.config",
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: True},
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: ""},
)
@mock.patch("superset.db_engine_specs.hive.g", spec={})
@mock.patch("tableschema.Table")
Expand All @@ -190,7 +190,7 @@ def test_create_table_from_csv_if_exists_fail(mock_table, mock_g):

@mock.patch(
"superset.db_engine_specs.hive.config",
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: True},
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: ""},
)
@mock.patch("superset.db_engine_specs.hive.g", spec={})
@mock.patch("tableschema.Table")
Expand All @@ -211,7 +211,7 @@ def test_create_table_from_csv_if_exists_fail_with_schema(mock_table, mock_g):

@mock.patch(
"superset.db_engine_specs.hive.config",
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: True},
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: ""},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC function returns a str and not a boolean.

)
@mock.patch("superset.db_engine_specs.hive.g", spec={})
@mock.patch("tableschema.Table")
Expand Down Expand Up @@ -239,7 +239,7 @@ def test_create_table_from_csv_if_exists_replace(mock_upload_to_s3, mock_table,

@mock.patch(
"superset.db_engine_specs.hive.config",
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: True},
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: ""},
)
@mock.patch("superset.db_engine_specs.hive.g", spec={})
@mock.patch("tableschema.Table")
Expand Down Expand Up @@ -347,6 +347,22 @@ def is_readonly(sql: str) -> bool:
assert is_readonly("WITH (SELECT 1) bla SELECT * from bla")


@pytest.mark.parametrize(
"schema,upload_prefix",
[("foo", "EXTERNAL_HIVE_TABLES/1/foo/"), (None, "EXTERNAL_HIVE_TABLES/1/")],
)
def test_s3_upload_prefix(schema: str, upload_prefix: str) -> None:
mock_database = mock.MagicMock()
mock_database.id = 1

assert (
app.config["CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC"](
database=mock_database, user=mock.MagicMock(), schema=schema
)
== upload_prefix
)


def test_upload_to_s3_no_bucket_path():
with pytest.raises(
Exception,
Expand Down