[SPARK-27661][SQL] Add SupportsNamespaces API #24560

rdblue · 2019-05-08T17:49:18Z

What changes were proposed in this pull request?

This adds an interface for catalog plugins that exposes namespace operations:

listNamespaces
namespaceExists
loadNamespaceMetadata
createNamespace
alterNamespace
dropNamespace

How was this patch tested?

API only. Existing tests for regressions.

rdblue · 2019-05-08T18:03:49Z

@mccheah, @jzhuge, @cloud-fan, and @marmbrus, it would be great to get feedback on this early prototype of the namespace API.

I think we may want to split this into two APIs depending on how catalogs handle namespaces. Not all catalogs will explicitly create namespaces. For example, you could think of paths as namespaces for tables in S3, but paths don't exist in S3 without objects so there is no "create" operation. Namespaces exist only if there are tables or objects in them. So we could have separate interfaces for the list and exists calls and another for namespaces that have metadata, like Hive.

SparkQA · 2019-05-08T21:04:20Z

Test build #105264 has finished for PR 24560 at commit 3aa9ebd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class SetProperty implements NamespaceChange
final class RemoveProperty implements NamespaceChange

cloud-fan · 2019-05-09T03:25:10Z

I feel separating creating and listing namespaces may make the API too complicated. I think it's OK if some implementation throws an exception in createNamespace.

SparkQA · 2019-05-15T11:01:58Z

Test build #105410 has finished for PR 24560 at commit 3aa9ebd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class SetProperty implements NamespaceChange
final class RemoveProperty implements NamespaceChange

rdblue · 2019-05-31T23:22:05Z

@cloud-fan, any more comments? Should I close this PR?

cloud-fan · 2019-07-25T00:24:04Z

retest this please

SparkQA · 2019-07-25T05:12:18Z

Test build #108142 has finished for PR 24560 at commit 3aa9ebd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class SetProperty implements NamespaceChange
final class RemoveProperty implements NamespaceChange

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/NamespaceChange.java

cloud-fan · 2019-07-25T11:15:32Z

generally looks good. Tow comments:

Shall we add String[] defaultNamespace as we discussed in the community sync?
Shall we add a dummy implementation and the unit tests?

brkyvz · 2019-07-25T18:07:08Z

Would love a dummy (test) implementation as an example as reference.

One thing we can do if we add String[] defaultNamespace is that when someone calls listNamespaces, we call the argumented listNamespaces with the defaultNamespace

rdblue · 2019-07-25T22:01:13Z

@cloud-fan & @brkyvz, I've added an implementation of the interfaces to the test catalog and added tests.

brkyvz

Thanks for the implementation. Have one question and one suggestion

brkyvz · 2019-07-25T22:18:53Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/SupportsNamespaces.java

+  }
+
+  /**
+   * List top-level namespaces from the catalog.


Should this list the top-level namespaces or within the defaultNamespace? If it's also just going to return a, why does it need to return an array of arrays instead of just an array?

It should list top-level namespaces. The default namespace is the one used as Spark's current namespace when the catalog is the default and should not change the behavior of the catalog.

It returns an array of arrays so you can pass each result into listNamespaces(Array[String]). Array[String] is the type that we consistently use for a namespace, so I think it is correct to return it here.

Adding to my comment about listing top-level namespaces: if the default namespace changed the catalog's behavior, we would have to document exactly how it should change the behavior and have a way to configure it.

I like keeping this simple instead, so these methods do the same thing every time and Spark doesn't require the plugin to maintain a current namespace as state.

brkyvz · 2019-07-25T22:21:38Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/SupportsNamespaces.java

+   * @throws NamespaceAlreadyExistsException If the namespace already exists
+   * @throws UnsupportedOperationException If create is not a supported operation
+   */
+  void createNamespaceMetadata(


createNamespace instead of createNamespaceMetadata

I recently changed this and I think it fits better with the semantics that we require. Namespaces are required to exist if they contain a table. We don't check whether the namespace exists on table creation and leave that to the implementation. So it makes more sense if this is create namespace metadata, because the namespace may already exist and we don't want to throw NamespaceAlreadyExists if there isn't already metadata. Does that make sense?

See this test: https://github.com/apache/spark/pull/24560/files#diff-dbc34e865cad1087fa8c24afe5e5e1aeR761-R773

Because the table exists, the namespace already exists. It would be surprising to fail the create because it was implicitly created earlier, and it would also be surprising to allow a create when it already exists. Adding "metadata" to the method name makes this more obvious because you can create the namespace metadata.

got it, that's fair.

Do you mean sometimes createNamespaceMetadata is the same as adding properties to an existing namespace?

@rdblue If the namespace already exists (implicitly for that matter), wouldn't you use alterNamespace to add the metadata?

@brkyvz, that also works (and has a test in this PR).

@mccheah, I would be surprised if tryCreateNamespace and getOrCreateNamespace updated the namespace metadata, which I think is part of the intent here.

Another option is ensureNamespaceExists without metadata, which would create or do nothing if it already exists. Then metadata could be added using alterNamespace. I don't think that is a very good option either because there is no clean create with metadata operation.

I'm not a fan of "implementation is free to decide the behaviors". When end-users run SQL queries in Spark, Spark should behave deterministically and consistently, instead of telling end-users "you will see different behaviors if you connect to different catalogs".

IMO, CREATE TABLE should fail if the namespace doesn't exist, as it's intuitive and commonly seen in other databases. If a catalog stores tables in an object store, and is hard to detect namespace existence, then it should not support namespace, or try harder to figure out a workaround.

We can add an option/SQL syntax to allow CREATE TABLE to create namespaces automatically, but it should be the end-users who decide which behavior to pick, not the catalog. This is similar to mkdir. Users must add the -p option to create non-existing parent directories. It's pretty bad if we say "mkdir may or may not create parent directories, it depends on which Linux distribution you run it".

@cloud-fan But the whole goal of DSV2 is to let the implementation decide the (internal) behavior, but have Spark to enforce the semantics. If you truncate a table with Spark, all Spark needs is that the next time that table is queried that there is no data in the table. It doesn't care if the table (datasource) decided to physically delete all the data or just did a logical deletion.

CREATE TABLE should fail if the namespace doesn't exist => This makes sense to me. I'd say this is the behavior that users expect. I don't think this would be a blocker if people start to implement object store backed catalogs. Then they can run CREATE NAMESPACE a.b.c, and then try the CREATE TABLE again

If a namespace already exists, then IMHO CREATE NAMESPACE should throw an error and they should run ALTER NAMESPACE to add the metadata

Regarding implicit namespaces in object stores, I feel that this still means that someone created the namespace through some other means (imagine using a GUI to create a MySQL table instead of the shell), and alter namespace should be used to add the metadata.

wdyt @rdblue @mccheah

@brkyvz when I say "behavior", I don't mean the internal behavior that is invisible to end-users, I actually mean the semantic that is visible to end-users.

brkyvz · 2019-07-26T00:37:41Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/SupportsNamespaces.java

+   * @throws IllegalStateException If the namespace is not empty
+   * @throws UnsupportedOperationException If drop is not a supported operation
+   */
+  boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException;


can we not have a boolean flag here called cascade that deletes all the Namespaces and Tables under this namespace if it is not empty?

We can consider adding it later. We should decide whether we want to require implementations to build that, or whether that should be done by Spark.

I see what you mean. Spark can list all tables within all namespaces and delete them if cascade = true, and not have the catalog do it. I can see the pros of that. I think something like that opens you up to race conditions too though. This is fine for now, just wanted to bring it up. If there's a real need (ask) we can always add it later

I'm generally in favor of Spark taking care of operations like these. Otherwise it puts a lot of responsibility on catalog plugins and we want to avoid making those too complex. Required complexity in a plugin will lead to buggy implementations and unreliable behavior across plugins.

SparkQA · 2019-07-26T02:57:00Z

Test build #108186 has finished for PR 24560 at commit ab5d21c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-08-03T23:05:48Z

@cloud-fan, @brkyvz, @mccheah, from the discussion, I think that the rename from createNamespace to createNamespaceMetadata was a bad idea. I think there is consensus around using createNamespace and failing calls if the namespace already exists because a table exists.

I've reverted the rename to createNamespaceMetadata and updated the tests. Now, the method is createNamespace and the tests validate that the namespace creation fails when a table exists.

Any objections to that behavior?

SparkQA · 2019-08-03T23:35:26Z

Test build #108595 has finished for PR 24560 at commit 14de5b6.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-05T00:46:44Z

Test build #108631 has finished for PR 24560 at commit 44afb5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

LGTM. Thanks @rdblue, merging to master!

cloud-fan · 2019-08-05T10:58:34Z

A late LGTM

rdblue · 2019-08-05T15:59:42Z

Thanks for reviewing, everyone!

rdblue mentioned this pull request May 8, 2019

[SPARK-24252][SQL] Add TableCatalog API #24246

Closed

rdblue mentioned this pull request Jun 5, 2019

[SPARK-27322][SQL] DataSourceV2 table relation #24741

Closed

dongjoon-hyun added the SQL label Jun 14, 2019

cloud-fan reviewed Jul 25, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/NamespaceChange.java Outdated Show resolved Hide resolved

rdblue force-pushed the SPARK-27661-add-catalog-namespace-api branch from 3aa9ebd to ab5d21c Compare July 25, 2019 21:58

brkyvz reviewed Jul 25, 2019

View reviewed changes

brkyvz reviewed Jul 26, 2019

View reviewed changes

rdblue added 3 commits August 3, 2019 15:54

Add SupportsNamespaces API.

4b3329b

Fix doc typo, add defaultNamespace method.

6ff538a

Implement SupportsNamespaces in TestTableCatalog, add tests.

347c999

rdblue force-pushed the SPARK-27661-add-catalog-namespace-api branch from ab5d21c to 14de5b6 Compare August 3, 2019 23:03

Restore createNamespace call and behavior.

44afb5c

rdblue force-pushed the SPARK-27661-add-catalog-namespace-api branch from 14de5b6 to 44afb5c Compare August 4, 2019 21:21

brkyvz approved these changes Aug 5, 2019

View reviewed changes

asfgit closed this in 0345f11 Aug 5, 2019

dongjoon-hyun mentioned this pull request Aug 5, 2019

[SPARK-28616][INFRA] Improve merge-spark-pr script to warn WIP PRs and strip trailing dots #25356

Closed

[SPARK-27661][SQL] Add SupportsNamespaces API #24560

[SPARK-27661][SQL] Add SupportsNamespaces API #24560

Conversation

rdblue commented May 8, 2019

What changes were proposed in this pull request?

How was this patch tested?

rdblue commented May 8, 2019

SparkQA commented May 8, 2019

cloud-fan commented May 9, 2019

SparkQA commented May 15, 2019

rdblue commented May 31, 2019

cloud-fan commented Jul 25, 2019

SparkQA commented Jul 25, 2019

cloud-fan commented Jul 25, 2019

brkyvz commented Jul 25, 2019

rdblue commented Jul 25, 2019

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brkyvz Jul 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 26, 2019

rdblue commented Aug 3, 2019

SparkQA commented Aug 3, 2019

SparkQA commented Aug 5, 2019

brkyvz left a comment

Choose a reason for hiding this comment

cloud-fan commented Aug 5, 2019

rdblue commented Aug 5, 2019

brkyvz Jul 26, 2019 •

edited

Loading