Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix correctness issue with double slash and add test for S3 and Glue metastore #17804

Closed
wants to merge 1 commit into from

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jun 8, 2023

Description

Fixes #17803

  • Add test for Hive Thrift metastore and MinIO

Release notes

# Hive
* Fix correctness issue when location contains double slash. ({issue}`17803`)

@cla-bot cla-bot bot added the cla-signed label Jun 8, 2023
@github-actions github-actions bot added delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector tests:hive labels Jun 8, 2023
@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch 4 times, most recently from a4c6287 to 39ece22 Compare June 13, 2023 11:46
@ebyhr ebyhr changed the title Add test for S3 and Glue metastore Fix correctness issue with double slash and add test for S3 and Glue metastore Jun 14, 2023
@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch from 39ece22 to 2da154d Compare June 14, 2023 04:43
@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch 3 times, most recently from 1a9eab5 to a317fd8 Compare June 14, 2023 08:42
@ebyhr ebyhr self-assigned this Jun 14, 2023
@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch from a317fd8 to 76a7baf Compare June 14, 2023 08:51
@ebyhr ebyhr marked this pull request as ready for review June 14, 2023 08:52
@@ -562,7 +563,7 @@ public synchronized void createTable(
ConnectorSession session,
Table table,
PrincipalPrivileges principalPrivileges,
Optional<Path> currentPath,
Optional<Location> currentPath,
Copy link
Member Author

@ebyhr ebyhr Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern about other usage of org.apache.hadoop.fs.Path, but hesitated to migrate everything in a PR. Let me know if I should migrate more in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably cannot solve all these problems at once.
we should revert 8bd9f75 to unblock the 420 release and then re-commit it behind a config toggle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to maintain compatibility with Hive HMS (or Athena), then per @hashhar 's comment #17923 (review) we may need to keep normalization for ever. May require further consideration though

cc @electrum @alexjo2144 @findinpath @losipiuk

@@ -630,10 +639,10 @@ private static boolean isManagedTable(Table table)
return table.getTableType().equals(MANAGED_TABLE.name());
}

private static void deleteDir(HdfsContext context, HdfsEnvironment hdfsEnvironment, Path path, boolean recursive)
private void deleteDir(HdfsContext context, Location path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes minor behavior change. For instance, we can't delete the directory having # anymore. I suppose this is acceptable and we should deny creating such a database or table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this has no real impact since for example if partition columns include # (in their name or value) the directory name contains it in URL encoded form and not raw #.

@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch from 76a7baf to 95e7f04 Compare June 14, 2023 09:20
@ebyhr ebyhr requested review from findinpath and pajaks June 14, 2023 09:48
@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch from 95e7f04 to 60ecb5a Compare June 14, 2023 13:29
@martint
Copy link
Member

martint commented Jun 14, 2023

Removed RELEASE-BLOCKER tag since issue is already tagged as such.

@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch from 60ecb5a to 0f7bc2e Compare June 15, 2023 04:23
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that #17848 goes 1st, this submission LGTM.

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;

public class TestHiveS3AndGlueMetastoreTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all these tests pass without your following commit - specifically the tests dealing with double slash within the schema location for Hive fails.

Copy link
Member Author

@ebyhr ebyhr Jun 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd commit doesn't run double slash test because the location is filtered in TestHiveS3AndGlueMetastoreTest.locationPatterns. Am I missing something?

@ebyhr ebyhr requested a review from findepi June 15, 2023 10:51
@findinpath
Copy link
Contributor

findinpath commented Jun 15, 2023

I'm looking through the productive code of Trino OSS and see that the constructor org.apache.hadoop.fs.Path#Path(java.lang.String) is still used extensively.

One such usage on which I've stumbled is:

io.trino.plugin.hive.HiveLocationService#forNewTableAsSelect

This should probably be modified.

In the test (even though currently 🟢 )

io.trino.plugin.hive.TestHiveS3AndGlueMetastoreTest#testBasicOperationsWithProvidedTableLocation for the value "s3://%s/%s//double_slash/%s", the lines should probably be modified

There are as @ebyhr pointed out a lot of other usages which, in special constellations, will probably turn out to be other bugs related to exotic paths.
In order to be pragmatical and not end up with a giant, I'd advocate that this PR is not supposed to deal with doing all the refactorings.

@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch 2 times, most recently from 49cac1e to 9d79c36 Compare June 15, 2023 12:14
@findepi
Copy link
Member

findepi commented Jun 15, 2023

@ebyhr please see test failures

@ebyhr
Copy link
Member Author

ebyhr commented Jun 16, 2023

CI failures are unrelated to this change in my opinion. I will file issues.

ci / test (plugin/trino-delta-lake, cloud-tests)
Error:    TestDeltaLakeGcsConnectorSmokeTest>BaseDeltaLakeConnectorSmokeTest.testStatsSplitPruningBasedOnSepCreatedCheckpointOnTopOfCheckpointWithJustStructStats:1552->registerTableFromResources:172 » NullPointer Cannot invoke "com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorage.create(com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.StorageResourceId, com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.CreateObjectOptions)" because "this.gcs" is null

ci / test (plugin/trino-kafka)
TestKafkaPlugin.testSslSpinup

ci / pt (default, suite-delta-lake-databricks104, )
TestDeltaLakeChangeDataFeedCompatibility.testMergeUpdateIntoTableWithCdfEnabled

@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch from d521c36 to 488e5c1 Compare June 16, 2023 06:53
@ebyhr
Copy link
Member Author

ebyhr commented Jun 16, 2023

Updated HiveLocationService. I confirmed testCreateTableWithDoubleSlash & testCtasWithDoubleSlash fail without the fix.

@ebyhr
Copy link
Member Author

ebyhr commented Jun 16, 2023

Extracted testing commit into #17930

@ebyhr
Copy link
Member Author

ebyhr commented Jun 19, 2023

Rebased on master to resolve conflicts.

@findepi
Copy link
Member

findepi commented Jun 19, 2023

TestSemiTransactionalHiveMetastore.testParallelUpdateStatisticsOperations failure is related.

https://github.com/trinodb/trino/actions/runs/5308675667/jobs/9608455959?pr=17804

@ebyhr ebyhr force-pushed the ebi/hive-glue-s3-test branch from fececf7 to 8cb4bd1 Compare June 19, 2023 10:46
@ebyhr
Copy link
Member Author

ebyhr commented Jun 19, 2023

The failure should be fixed by the new push.

Comment on lines +256 to +293
@Test
public void testCreateTableWithDoubleSlash()
{
String schemaName = "test_create_table_with_double_slash_" + randomNameSuffix();
String schemaLocation = "s3://%s/%s/double_slash//test_schema".formatted(bucketName, schemaName);
String tableName = "test_create_table_with_double_slash_" + randomNameSuffix();
String schemaTableName = schemaName + "." + tableName;

// Previously, HiveLocationService replaced double slash with single slash
assertUpdate("CREATE SCHEMA " + schemaName + " WITH (location = '" + schemaLocation + "')");
String existingKey = "%s/double_slash/test_schema/%s".formatted(schemaName, tableName);
s3.putObject(bucketName, existingKey, "test content");

assertUpdate("CREATE TABLE " + schemaTableName + "(col_int int)");
assertUpdate("INSERT INTO " + schemaTableName + " VALUES 1", 1);
assertQuery("SELECT * FROM " + schemaTableName, "VALUES 1");
assertUpdate("DROP TABLE " + schemaTableName);
s3.deleteObject(bucketName, existingKey);
}

@Test
public void testCtasWithDoubleSlash()
{
String schemaName = "test_ctas_with_double_slash_" + randomNameSuffix();
String schemaLocation = "s3://%s/%s/double_slash//test_schema".formatted(bucketName, schemaName);
String tableName = "test_create_table_with_double_slash_" + randomNameSuffix();
String schemaTableName = schemaName + "." + tableName;

// Previously, HiveLocationService replaced double slash with single slash
assertUpdate("CREATE SCHEMA " + schemaName + " WITH (location = '" + schemaLocation + "')");
String existingKey = "%s/double_slash/test_schema/%s".formatted(schemaName, tableName);
s3.putObject(bucketName, existingKey, "test content");

assertUpdate("CREATE TABLE " + schemaTableName + " AS SELECT 1 AS col_int", 1);
assertQuery("SELECT * FROM " + schemaTableName, "VALUES 1");
assertUpdate("DROP TABLE " + schemaTableName);
s3.deleteObject(bucketName, existingKey);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these new tests needed?
When working on #17958 i was under impression that other existing tests (those with data provider) cover these cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are needed for changes in HiveLocationService. Existing tests don't cover this scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testBasicOperationsWithProvidedSchemaLocation exercises HiveLocationService
is it missing more coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testBasicOperationsWithProvidedSchemaLocation doesn't cover the test when the normalized table location already exists in S3. This test puts an object on single-slash table location and then creates a table on double-slash location. HiveLocationService threw an incorrect already-exists error because the class internally replaced double-slash with single-slash.

It wasn't a silent correctness issue, so no need to fix it in this release, but it would be nice to add the test and improve it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed.
let's have explanatory comment in the test. it seems at least for me understanding its importance wasn't easy

@findepi
Copy link
Member

findepi commented Aug 1, 2023

Fixes #17803

given that issue is closed, can you please remind me which portions of this PR we still want to go ahead with?

@ebyhr
Copy link
Member Author

ebyhr commented Oct 26, 2023

Superseded by #17958

@ebyhr ebyhr closed this Oct 26, 2023
@ebyhr ebyhr deleted the ebi/hive-glue-s3-test branch October 26, 2023 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

Correctness issue with double slash location on Glue & S3 in Hive connector
5 participants