-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27661][SQL] Add SupportsNamespaces API #24560
[SPARK-27661][SQL] Add SupportsNamespaces API #24560
Conversation
@mccheah, @jzhuge, @cloud-fan, and @marmbrus, it would be great to get feedback on this early prototype of the namespace API. I think we may want to split this into two APIs depending on how catalogs handle namespaces. Not all catalogs will explicitly create namespaces. For example, you could think of paths as namespaces for tables in S3, but paths don't exist in S3 without objects so there is no "create" operation. Namespaces exist only if there are tables or objects in them. So we could have separate interfaces for the list and exists calls and another for namespaces that have metadata, like Hive. |
Test build #105264 has finished for PR 24560 at commit
|
I feel separating creating and listing namespaces may make the API too complicated. I think it's OK if some implementation throws an exception in |
Test build #105410 has finished for PR 24560 at commit
|
@cloud-fan, any more comments? Should I close this PR? |
retest this please |
Test build #108142 has finished for PR 24560 at commit
|
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/NamespaceChange.java
Outdated
Show resolved
Hide resolved
generally looks good. Tow comments:
|
Would love a dummy (test) implementation as an example as reference. One thing we can do if we add |
3aa9ebd
to
ab5d21c
Compare
@cloud-fan & @brkyvz, I've added an implementation of the interfaces to the test catalog and added tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the implementation. Have one question and one suggestion
} | ||
|
||
/** | ||
* List top-level namespaces from the catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this list the top-level namespaces or within the defaultNamespace
? If it's also just going to return a
, why does it need to return an array of arrays instead of just an array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should list top-level namespaces. The default namespace is the one used as Spark's current namespace when the catalog is the default and should not change the behavior of the catalog.
It returns an array of arrays so you can pass each result into listNamespaces(Array[String])
. Array[String]
is the type that we consistently use for a namespace, so I think it is correct to return it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to my comment about listing top-level namespaces: if the default namespace changed the catalog's behavior, we would have to document exactly how it should change the behavior and have a way to configure it.
I like keeping this simple instead, so these methods do the same thing every time and Spark doesn't require the plugin to maintain a current namespace as state.
* @throws NamespaceAlreadyExistsException If the namespace already exists | ||
* @throws UnsupportedOperationException If create is not a supported operation | ||
*/ | ||
void createNamespaceMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createNamespace
instead of createNamespaceMetadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently changed this and I think it fits better with the semantics that we require. Namespaces are required to exist if they contain a table. We don't check whether the namespace exists on table creation and leave that to the implementation. So it makes more sense if this is create namespace metadata, because the namespace may already exist and we don't want to throw NamespaceAlreadyExists if there isn't already metadata. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this test: https://github.com/apache/spark/pull/24560/files#diff-dbc34e865cad1087fa8c24afe5e5e1aeR761-R773
Because the table exists, the namespace already exists. It would be surprising to fail the create because it was implicitly created earlier, and it would also be surprising to allow a create when it already exists. Adding "metadata" to the method name makes this more obvious because you can create the namespace metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, that's fair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean sometimes createNamespaceMetadata
is the same as adding properties to an existing namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue If the namespace already exists (implicitly for that matter), wouldn't you use alterNamespace to add the metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brkyvz, that also works (and has a test in this PR).
@mccheah, I would be surprised if tryCreateNamespace
and getOrCreateNamespace
updated the namespace metadata, which I think is part of the intent here.
Another option is ensureNamespaceExists
without metadata, which would create or do nothing if it already exists. Then metadata could be added using alterNamespace
. I don't think that is a very good option either because there is no clean create with metadata operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of "implementation is free to decide the behaviors". When end-users run SQL queries in Spark, Spark should behave deterministically and consistently, instead of telling end-users "you will see different behaviors if you connect to different catalogs".
IMO, CREATE TABLE
should fail if the namespace doesn't exist, as it's intuitive and commonly seen in other databases. If a catalog stores tables in an object store, and is hard to detect namespace existence, then it should not support namespace, or try harder to figure out a workaround.
We can add an option/SQL syntax to allow CREATE TABLE
to create namespaces automatically, but it should be the end-users who decide which behavior to pick, not the catalog. This is similar to mkdir
. Users must add the -p
option to create non-existing parent directories. It's pretty bad if we say "mkdir may or may not create parent directories, it depends on which Linux distribution you run it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan But the whole goal of DSV2 is to let the implementation decide the (internal) behavior, but have Spark to enforce the semantics. If you truncate a table with Spark, all Spark needs is that the next time that table is queried that there is no data in the table. It doesn't care if the table (datasource) decided to physically delete all the data or just did a logical deletion.
CREATE TABLE
should fail if the namespace doesn't exist => This makes sense to me. I'd say this is the behavior that users expect. I don't think this would be a blocker if people start to implement object store backed catalogs. Then they can runCREATE NAMESPACE a.b.c
, and then try the CREATE TABLE again- If a namespace already exists, then IMHO CREATE NAMESPACE should throw an error and they should run ALTER NAMESPACE to add the metadata
- Regarding implicit namespaces in object stores, I feel that this still means that someone created the namespace through some other means (imagine using a GUI to create a MySQL table instead of the shell), and alter namespace should be used to add the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brkyvz when I say "behavior", I don't mean the internal behavior that is invisible to end-users, I actually mean the semantic that is visible to end-users.
* @throws IllegalStateException If the namespace is not empty | ||
* @throws UnsupportedOperationException If drop is not a supported operation | ||
*/ | ||
boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not have a boolean flag here called cascade
that deletes all the Namespaces and Tables under this namespace if it is not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider adding it later. We should decide whether we want to require implementations to build that, or whether that should be done by Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. Spark can list all tables within all namespaces and delete them if cascade = true, and not have the catalog do it. I can see the pros of that. I think something like that opens you up to race conditions too though. This is fine for now, just wanted to bring it up. If there's a real need (ask) we can always add it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally in favor of Spark taking care of operations like these. Otherwise it puts a lot of responsibility on catalog plugins and we want to avoid making those too complex. Required complexity in a plugin will lead to buggy implementations and unreliable behavior across plugins.
Test build #108186 has finished for PR 24560 at commit
|
ab5d21c
to
14de5b6
Compare
@cloud-fan, @brkyvz, @mccheah, from the discussion, I think that the rename from I've reverted the rename to Any objections to that behavior? |
Test build #108595 has finished for PR 24560 at commit
|
14de5b6
to
44afb5c
Compare
Test build #108631 has finished for PR 24560 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @rdblue, merging to master!
A late LGTM |
Thanks for reviewing, everyone! |
What changes were proposed in this pull request?
This adds an interface for catalog plugins that exposes namespace operations:
listNamespaces
namespaceExists
loadNamespaceMetadata
createNamespace
alterNamespace
dropNamespace
How was this patch tested?
API only. Existing tests for regressions.