-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(hive): Update CSV to Hive upload prefix #14255
fix(hive): Update CSV to Hive upload prefix #14255
Conversation
569f20b
to
7f711a8
Compare
@@ -209,7 +209,7 @@ def test_create_table_from_csv_if_exists_fail_with_schema(mock_table, mock_g): | |||
|
|||
@mock.patch( | |||
"superset.db_engine_specs.hive.config", | |||
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: True}, | |||
{**app.config, "CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC": lambda *args: ""}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC
function returns a str
and not a boolean
.
7f711a8
to
01aa899
Compare
defaced
to
c4e75a5
Compare
database: "Database", | ||
user: "models.User", # pylint: disable=unused-argument | ||
schema: Optional[str], | ||
) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the return type has changed from Optional[str]
to str
since a path must be specified.
Codecov Report
@@ Coverage Diff @@
## master #14255 +/- ##
==========================================
- Coverage 76.97% 76.87% -0.11%
==========================================
Files 953 953
Lines 48061 48062 +1
Branches 5971 5971
==========================================
- Hits 36997 36947 -50
- Misses 10862 10913 +51
Partials 202 202
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
# Note the final empty path enforces a trailing slash. | ||
return os.path.join( | ||
CSV_TO_HIVE_UPLOAD_DIRECTORY, str(database.id), schema or "", "" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using database.database_name
instead of database.id
to have more readable file paths (as well as allowing re-creating database yet still uploading to the same directory)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ktmud that was my original thinking as well, however these can have spaces etc. in them which could be problematic for S3 keys. I thought a database ID was the safest and also remains unchanged should someone rename a database (the name is purely for presentation purposes).
e51582c
to
71b2443
Compare
71b2443
to
0ea817e
Compare
* fix(hive): Update CSV to Hive upload prefix * Trigger notification Co-authored-by: John Bodley <john.bodley@airbnb.com>
SUMMARY
This PR updates the default
CSV_TO_HIVE_UPLOAD_DIRECTORY_FUNC
to leverage the previously unused database and schema. Previously the S3 key (which is used as the table location in Hive) was agnostic of database (cluster) or table schema meaning uploading a CSV to an existing table with the same name (sans cluster and schema) created via the CSV flow would overwrite the S3 key.The fix is to ensure that the database ID and schema (if defined) are used in the default callable. Not an undefined schema infers the default (database dependent) schema.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
CI.
ADDITIONAL INFORMATION