Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27661][SQL] Add SupportsNamespaces API #24560

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented May 8, 2019

What changes were proposed in this pull request?

This adds an interface for catalog plugins that exposes namespace operations:

  • listNamespaces
  • namespaceExists
  • loadNamespaceMetadata
  • createNamespace
  • alterNamespace
  • dropNamespace

How was this patch tested?

API only. Existing tests for regressions.

@rdblue
Copy link
Contributor Author

rdblue commented May 8, 2019

@mccheah, @jzhuge, @cloud-fan, and @marmbrus, it would be great to get feedback on this early prototype of the namespace API.

I think we may want to split this into two APIs depending on how catalogs handle namespaces. Not all catalogs will explicitly create namespaces. For example, you could think of paths as namespaces for tables in S3, but paths don't exist in S3 without objects so there is no "create" operation. Namespaces exist only if there are tables or objects in them. So we could have separate interfaces for the list and exists calls and another for namespaces that have metadata, like Hive.

@SparkQA
Copy link

SparkQA commented May 8, 2019

Test build #105264 has finished for PR 24560 at commit 3aa9ebd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class SetProperty implements NamespaceChange
  • final class RemoveProperty implements NamespaceChange

@cloud-fan
Copy link
Contributor

I feel separating creating and listing namespaces may make the API too complicated. I think it's OK if some implementation throws an exception in createNamespace.

@SparkQA
Copy link

SparkQA commented May 15, 2019

Test build #105410 has finished for PR 24560 at commit 3aa9ebd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class SetProperty implements NamespaceChange
  • final class RemoveProperty implements NamespaceChange

@rdblue
Copy link
Contributor Author

rdblue commented May 31, 2019

@cloud-fan, any more comments? Should I close this PR?

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108142 has finished for PR 24560 at commit 3aa9ebd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class SetProperty implements NamespaceChange
  • final class RemoveProperty implements NamespaceChange

@cloud-fan
Copy link
Contributor

generally looks good. Tow comments:

  1. Shall we add String[] defaultNamespace as we discussed in the community sync?
  2. Shall we add a dummy implementation and the unit tests?

@brkyvz
Copy link
Contributor

brkyvz commented Jul 25, 2019

Would love a dummy (test) implementation as an example as reference.

One thing we can do if we add String[] defaultNamespace is that when someone calls listNamespaces, we call the argumented listNamespaces with the defaultNamespace

@rdblue rdblue force-pushed the SPARK-27661-add-catalog-namespace-api branch from 3aa9ebd to ab5d21c Compare July 25, 2019 21:58
@rdblue
Copy link
Contributor Author

rdblue commented Jul 25, 2019

@cloud-fan & @brkyvz, I've added an implementation of the interfaces to the test catalog and added tests.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the implementation. Have one question and one suggestion

}

/**
* List top-level namespaces from the catalog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this list the top-level namespaces or within the defaultNamespace? If it's also just going to return a, why does it need to return an array of arrays instead of just an array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should list top-level namespaces. The default namespace is the one used as Spark's current namespace when the catalog is the default and should not change the behavior of the catalog.

It returns an array of arrays so you can pass each result into listNamespaces(Array[String]). Array[String] is the type that we consistently use for a namespace, so I think it is correct to return it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to my comment about listing top-level namespaces: if the default namespace changed the catalog's behavior, we would have to document exactly how it should change the behavior and have a way to configure it.

I like keeping this simple instead, so these methods do the same thing every time and Spark doesn't require the plugin to maintain a current namespace as state.

* @throws NamespaceAlreadyExistsException If the namespace already exists
* @throws UnsupportedOperationException If create is not a supported operation
*/
void createNamespaceMetadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createNamespace instead of createNamespaceMetadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently changed this and I think it fits better with the semantics that we require. Namespaces are required to exist if they contain a table. We don't check whether the namespace exists on table creation and leave that to the implementation. So it makes more sense if this is create namespace metadata, because the namespace may already exist and we don't want to throw NamespaceAlreadyExists if there isn't already metadata. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this test: https://github.com/apache/spark/pull/24560/files#diff-dbc34e865cad1087fa8c24afe5e5e1aeR761-R773

Because the table exists, the namespace already exists. It would be surprising to fail the create because it was implicitly created earlier, and it would also be surprising to allow a create when it already exists. Adding "metadata" to the method name makes this more obvious because you can create the namespace metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, that's fair.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean sometimes createNamespaceMetadata is the same as adding properties to an existing namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue If the namespace already exists (implicitly for that matter), wouldn't you use alterNamespace to add the metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brkyvz, that also works (and has a test in this PR).

@mccheah, I would be surprised if tryCreateNamespace and getOrCreateNamespace updated the namespace metadata, which I think is part of the intent here.

Another option is ensureNamespaceExists without metadata, which would create or do nothing if it already exists. Then metadata could be added using alterNamespace. I don't think that is a very good option either because there is no clean create with metadata operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of "implementation is free to decide the behaviors". When end-users run SQL queries in Spark, Spark should behave deterministically and consistently, instead of telling end-users "you will see different behaviors if you connect to different catalogs".

IMO, CREATE TABLE should fail if the namespace doesn't exist, as it's intuitive and commonly seen in other databases. If a catalog stores tables in an object store, and is hard to detect namespace existence, then it should not support namespace, or try harder to figure out a workaround.

We can add an option/SQL syntax to allow CREATE TABLE to create namespaces automatically, but it should be the end-users who decide which behavior to pick, not the catalog. This is similar to mkdir. Users must add the -p option to create non-existing parent directories. It's pretty bad if we say "mkdir may or may not create parent directories, it depends on which Linux distribution you run it".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan But the whole goal of DSV2 is to let the implementation decide the (internal) behavior, but have Spark to enforce the semantics. If you truncate a table with Spark, all Spark needs is that the next time that table is queried that there is no data in the table. It doesn't care if the table (datasource) decided to physically delete all the data or just did a logical deletion.

  • CREATE TABLE should fail if the namespace doesn't exist => This makes sense to me. I'd say this is the behavior that users expect. I don't think this would be a blocker if people start to implement object store backed catalogs. Then they can run CREATE NAMESPACE a.b.c, and then try the CREATE TABLE again
  • If a namespace already exists, then IMHO CREATE NAMESPACE should throw an error and they should run ALTER NAMESPACE to add the metadata
  • Regarding implicit namespaces in object stores, I feel that this still means that someone created the namespace through some other means (imagine using a GUI to create a MySQL table instead of the shell), and alter namespace should be used to add the metadata.

wdyt @rdblue @mccheah

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brkyvz when I say "behavior", I don't mean the internal behavior that is invisible to end-users, I actually mean the semantic that is visible to end-users.

* @throws IllegalStateException If the namespace is not empty
* @throws UnsupportedOperationException If drop is not a supported operation
*/
boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not have a boolean flag here called cascade that deletes all the Namespaces and Tables under this namespace if it is not empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can consider adding it later. We should decide whether we want to require implementations to build that, or whether that should be done by Spark.

Copy link
Contributor

@brkyvz brkyvz Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Spark can list all tables within all namespaces and delete them if cascade = true, and not have the catalog do it. I can see the pros of that. I think something like that opens you up to race conditions too though. This is fine for now, just wanted to bring it up. If there's a real need (ask) we can always add it later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally in favor of Spark taking care of operations like these. Otherwise it puts a lot of responsibility on catalog plugins and we want to avoid making those too complex. Required complexity in a plugin will lead to buggy implementations and unreliable behavior across plugins.

@SparkQA
Copy link

SparkQA commented Jul 26, 2019

Test build #108186 has finished for PR 24560 at commit ab5d21c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-27661-add-catalog-namespace-api branch from ab5d21c to 14de5b6 Compare August 3, 2019 23:03
@rdblue
Copy link
Contributor Author

rdblue commented Aug 3, 2019

@cloud-fan, @brkyvz, @mccheah, from the discussion, I think that the rename from createNamespace to createNamespaceMetadata was a bad idea. I think there is consensus around using createNamespace and failing calls if the namespace already exists because a table exists.

I've reverted the rename to createNamespaceMetadata and updated the tests. Now, the method is createNamespace and the tests validate that the namespace creation fails when a table exists.

Any objections to that behavior?

@SparkQA
Copy link

SparkQA commented Aug 3, 2019

Test build #108595 has finished for PR 24560 at commit 14de5b6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue rdblue force-pushed the SPARK-27661-add-catalog-namespace-api branch from 14de5b6 to 44afb5c Compare August 4, 2019 21:21
@SparkQA
Copy link

SparkQA commented Aug 5, 2019

Test build #108631 has finished for PR 24560 at commit 44afb5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @rdblue, merging to master!

@cloud-fan
Copy link
Contributor

A late LGTM

@rdblue
Copy link
Contributor Author

rdblue commented Aug 5, 2019

Thanks for reviewing, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants