-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24252][SQL] Add TableCatalog API #24246
Conversation
@mccheah, @cloud-fan, here's the TableCatalog API. |
Test build #104101 has finished for PR 24246 at commit
|
1773230
to
1c2b2cd
Compare
Test build #104175 has finished for PR 24246 at commit
|
@cloud-fan, please also start reviewing the last commit on this PR so that we can get it in fairly quickly after #24117 is in. |
/** | ||
* Refresh table metadata for {@link Identifier identifier}. | ||
* <p> | ||
* If the table is already loaded or cached, drop cached data and reload it. If the table is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is in the scope of this PR, but I was wondering if we could make the metadata/schema for a table un cacheable by default. For example, in the current V1 code if I reference a relation in the Hive catalog the session will cache the metadata associated with the table. This means that I can't update a user's view of the in catalog table without a manual user refresh.
I was hoping we can signal via the catalog api that a table's metadata is dynamic and should not be cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, we've implemented caching in our catalogs. Spark always calls load and uses the version that is returned instead of caching. Our catalog then uses a weak reference in the cache so that the version is periodically updated when no more queries are running against that version.
I'm not sure whether Spark should have some caching above the catalog or if the catalog should be responsible for this. Let's plan on talking about this at the next sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
#24117 merged, this should be rebased |
e589ebb
to
9aebe0f
Compare
@@ -47,23 +47,23 @@ | |||
* <p> | |||
* Truncating a table removes all existing rows. | |||
* <p> | |||
* See {@link org.apache.spark.sql.sources.v2.writer.SupportsTruncate}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider moving the v2 source API into the catalyst module as well. That would fix these links, but it a much larger issue. I'll plan on bringing it up at the next v2 sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to move everything to catalyst
@@ -96,7 +96,7 @@ public Transform bucket(int numBuckets, String... columns) { | |||
* @param column an input column | |||
* @return a logical identity transform with name "identity" | |||
*/ | |||
public Transform identity(String column) { | |||
public static Transform identity(String column) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing these methods that should have been static is required here: https://github.com/apache/spark/pull/24246/files#diff-d921fee79b4e63ab684dec9d10afeb44R63.
@mccheah and @cloud-fan, this is now rebased on master and is ready for a review. Thanks! |
Test build #104526 has finished for PR 24246 at commit
|
Test build #104532 has finished for PR 24246 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More stuff incoming
Table createTable(Identifier ident, | ||
StructType schema, | ||
Transform[] partitions, | ||
Map<String, String> properties) throws TableAlreadyExistsException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move properties
before partitions
, we can make partitions
a Transform...
. Might be something to consider, but I don't feel too strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called from Spark, so I don't think it is valuable to change it to Transform...
Either way, the implementation gets a Transform[]
.
* @param newDataType the new data type | ||
* @return a TableChange for the update | ||
*/ | ||
static TableChange updateColumn(String[] fieldNames, DataType newDataType, boolean isNullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all of these methods, do we want to provide variants where a single field name can be updated without having to pass in an array? E.g. static TableChange updateColumn(String fieldName, DataType newDataType, boolean isNullable) {...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better for UX, but we will have too many overloads. How about we create a class ColumnBuilder
?
case class Column(nameParts: Array[String], dt: DataType, nullable: Boolean, comment: Option[String])
class ColumnBuilder {
def withName(name: String)
def withNameParts(nameParts: String)
def withNullable(nullable: Boolean)
...
}
And in TableChange
we just need to declare Column
in the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mccheah, I don't see much value in adding overloads here. This is called from Spark to create changes to pass to an implementation, so I think it would be better to have more consistency than to have a slightly friendlier call in a few places.
@cloud-fan, these are factory methods for specific changes that are called by Spark. The classes themselves encapsulate the data for the change, so it would be redundant to add a Column
class: UpdateColumn(Column(fieldNames, newDataType, isNullable))
instead of UpdateColumn(fieldNames, newDataType, isNullable)
.
Also, it isn't correct to pass extra information that wasn't in the ALTER TABLE
statement (or other APIs) into the source. See https://github.com/apache/spark/pull/24246/files#r276865657
@@ -183,17 +149,10 @@ private[sql] final case class LiteralValue[T](value: T, dataType: DataType) exte | |||
} | |||
|
|||
private[sql] final case class FieldReference(parts: Seq[String]) extends NamedReference { | |||
import org.apache.spark.sql.catalog.v2.CatalogV2Implicits._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only need CatalogV2Implicits
for quoted
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should import everything from CatalogV2Implicits
then. Importing implicits can invoke unexpected calls to said implicit methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific about what the risk is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of Scala implicits is that the implicit conversions or methods are exactly that - implicit - so they can be called without specifically invoking the method names. So a common philosophy around implicits is to only import them if such conversions are definitely required. Otherwise we can create side effects of calling the implicit methods when we don't mean to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes all of the implicit conversions available, but does not apply them unless it is necessary. For these implicit classes, the only time it is necessary is when one of the methods, like quoted
is called. (The compiler would also implicitly convert when an object is coerced to the implicit type as well, but there are no methods that require an IdentifierHelper
.)
It is fine to import unused implicits. The cases where the compiler will use them are predictable, and they can be used to provide a richer Scala API for classes that are limited because they are part of the public API.
I'm also wary of using Scala's implicit
, but the ability to add methods to existing classes is one of the cases where I find them reliable and useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that here we should only be importing the specific method that's used - maybe move quoted
to a helper class. If we really need the implicits functionality we can import it later on. Can we do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can import just the required implicit class, but I don't think there is any risk to using implicits like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/TestTableCatalog.scala
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalog/v2/TestTableCatalog.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
shall we open a PR to move DS v2 classes from sql/core to sql/catalyst first? This PR is easier to review without those import changes. |
@@ -47,23 +47,23 @@ | |||
* <p> | |||
* Truncating a table removes all existing rows. | |||
* <p> | |||
* See {@link org.apache.spark.sql.sources.v2.writer.SupportsTruncate}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to move everything to catalyst
* @return the table's metadata | ||
* @throws NoSuchTableException If the table doesn't exist. | ||
*/ | ||
Table loadTable(Identifier ident) throws NoSuchTableException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ExternalCatalog
, we have getTable
and loadTable
. It's a little weird to see loadTable
here which is actually the same as ExternalCatalog.getTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the word "load" implies that, the table will be loaded to somewhere. Will we cache the table metadata somewhere inside Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table is loaded so Spark can use it. This usage not unusual for the word "load".
The name getTable
is no appropriate here because in the Java world, "get" signals that the operation is merely returning something that already exists, like an instance field. This is expected to make a remote call to fetch information from another store, so it is not an appropriate use of "get".
I think that the fact that this takes an identifier and returns a Table
makes the operation clear. If you have a better verb for this (that isn't "get"), I'd consider it. I can't think of anything that works as well as loadTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we support the Hive LOAD TABLE in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, LOAD TABLE
is a SQL operation, not a catalog operation.
LOAD TABLE
can be implemented as a write to some Table
. If we wanted to support a path to load compatible data files directly into a table as an optimization, then we would need to build an API for that. Such an API would be part of a source API and not a catalog API.
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableCatalog.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableChange.java
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableChange.java
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableChange.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/TableChange.java
Outdated
Show resolved
Hide resolved
* @param newDataType the new data type | ||
* @return a TableChange for the update | ||
*/ | ||
static TableChange updateColumn(String[] fieldNames, DataType newDataType, boolean isNullable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better for UX, but we will have too many overloads. How about we create a class ColumnBuilder
?
case class Column(nameParts: Array[String], dt: DataType, nullable: Boolean, comment: Option[String])
class ColumnBuilder {
def withName(name: String)
def withNameParts(nameParts: String)
def withNullable(nullable: Boolean)
...
}
And in TableChange
we just need to declare Column
in the methods.
* @param newComment the new comment | ||
* @return a TableChange for the update | ||
*/ | ||
static TableChange updateComment(String[] fieldNames, String newComment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need a new entity for updating column comment? Can we do it via UpdateColumn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to pass data from the source back into the source. For example, if a user runs ALTER TABLE t ALTER COLUMN c COMMENT 'some doc'
, why would we pass the type of column c
back to the source? If Spark passed the type, then the source would need to detect what changed (the comment) and what didn't (the type) even though Spark already had that information.
It is also not correct to pass the type in when the user only requested a change to the comment. If the table has changed since the source loaded it, passing the type could tell the source to revert another change. Regardless of whether this is likely to happen, it would be a correctness bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. Shall we do it to all the column fields? like changing column data type only, changing column nullable only. Then we don't need the general UpdateColumn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, no. The type of a field is both the DataType
and nullability in SQL, which is why types are expressed as data_type [NOT NULL]
.
For example:
ALTER TABLE t ALTER COLUMN c COMMENT 'doc'
ALTER TABLE t ALTER COLUMN i TYPE bigint
ALTER TABLE t ALTER COLUMN f TYPE double NOT NULL
ALTER TABLE t ALTER COLUMN d TYPE decimal(18,2) NOT NULL COMMENT 'doc'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm I understand your statement, are you saying
- data type and nullable together mean the field type, although users can alter one of them via SQL command, the catalog should always update them together.
- comment is a separated property and can be updated separately.
If this is true, I think we should have 3 methods: updateType
, updateComment
and updateColumn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to keep it simple and avoid creating a class to pass every combination of changes. If a user makes multiple changes at the same time, they are all passed to alterTable
. That avoids the need to create combinations of these changes, like updateColumn
that combines updating the comment and updating the type.
* ) | ||
* </pre> | ||
*/ | ||
public interface TableChange { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need APIs to manage namespaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a PR for this, but as we discussed in the sync, it is a separate feature and a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a PR for this: #24560
* @param comment the new field's comment string | ||
* @return a TableChange for the addition | ||
*/ | ||
static TableChange addColumn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we decided to drop the overloads at the first version, shall we also remove the overloads for addColumn
? Then it's consistent with the API design of createTable
and we can change them together in the future if we agree to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are different cases. addColumn
overloads are to ensure that consistent defaults are passed to sources and these are implemented in Spark, not in sources. The createTable
overloads allowed implementations to accept those calls directly without using defaults, but added to the number of methods that implementations could possibly implement. That second part was the motivation for removing the createTable
methods, which doesn't apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, makes sense.
* @param newDataType the new data type | ||
* @return a TableChange for the update | ||
*/ | ||
static TableChange updateColumn(String[] fieldNames, DataType newDataType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it creates a UpdateColumnType
now, shall we rename this method to updateType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If others agree, I can do this. I don't see a compelling reason to do it so I'll wait for other people to add their opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine as is (since there are no other ways to "update" a column), if you were gonna change it I would go with updateColumnType anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there are no other ways to "update" a column
There is a updateComment
method below, which updates column comment. Shall we name these 2 methods coherently?
There is one thing worth discussing: By default, Spark is case insensitive and case preserving. There is one config to change the behavior, but it's internal and we don't expect users to set it. I think it's simpler to ask Different ideas are welcome! |
* | ||
* @param ident a table identifier | ||
*/ | ||
default void invalidateTable(Identifier ident) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the use case of this method? I can only think of one use case: when end users run REFRESH TABLE
. Is there any more use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be called by REFRESH TABLE
to invalidate cached data. This was previously named refreshTable
, but you preferred a void return type. In that case, invalidateTable
describes what the method does better than refresh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about invalidateTableCache
? I think that would be more straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a misleading name. This invalidates a single table's cached data, not the entire cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method accepts one parameter of Identifier type, which should not be misleading.
Just suggestion, overall I don't have a strong opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quick question. This sounds like only to invalidate cached table metadata, and not touch query cache. But Catalog.refreshTable
that is used v1 REFRESH TABLE
command also drops InMemoryRelation
if the table is cached in Spark query cache.
Is the difference intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table cache is managed by Spark, so the API here can not affect table cache inside Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I think even though the table cache is managed by Spark, currently the behavior is different between v1 and v2 (which doesn't invalidate the table cache). When a cached temp view is created such as using CACHE TABLE ... AS SELECT
from a v2 table, and later on REFRESH TABLE
command is called on the v2 table, the cache which won't get invalidated. Is this something we should resolve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should add this kind of missing actions into v2 commands as well. I think DROP TABLE has the same issue.
@cloud-fan, I don't think sources need to be case insensitive. Spark should assume that sources are case sensitive and apply its case sensitivity setting. There are 2 main cases: table identifiers and column identifiers:
@gatorsmile, any additional comments on case sensitivity? |
Do you mean end-users can no longer loop up tables from the catalog in a case insensitive style? e.g. |
@cloud-fan, it depends on the catalog. The default catalog will be case insensitive and will not break compatibility. But you can't force a catalog to be case insensitive because they are external systems. How would you make a case sensitive path-based table case insensitive? |
That's a good point. I took another look at the Shall we add some doc to |
What is the value of requiring a catalog to be case sensitive? If a user requests |
Do you mean the case sensitivity behavior should totally be decided by the catalog implementation itself? If that's the case I think at least the catalog implementation should inform Spark if it's case sensitive or not. |
@cloud-fan, why? When identifying tables, what would Spark do differently depending on whether the catalog is case sensitive or not? Like I said, Spark can only pass the identifier from the query and rely on the catalog to either find a table or not. |
I followed up with @cloud-fan and we came to an agreement on case sensitivity:
|
@gatorsmile, any other comments on case sensitivity? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tests pass
/** | ||
* Catalog methods for working with Tables. | ||
* <p> | ||
* TableCatalog implementations may be case sensitive or case insensitive. Spark will pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may worth to add a method to report this information. We can think about it later.
Thanks, Xiao! I agree, we can still revisit the decisions in this PR after further review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one minor comment
Test build #105224 has finished for PR 24246 at commit
|
thanks, merging to master! |
Thanks @cloud-fan! |
## What changes were proposed in this pull request? This adds the TableCatalog API proposed in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d). For `TableCatalog` to use `Table`, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving `TableCapability`. Most of the files touched by this PR are import changes needed by this move. ## How was this patch tested? This adds a test implementation and contract tests. Closes apache#24246 from rdblue/SPARK-24252-add-table-catalog-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core. I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go. Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module. This PR can solve the problem we have in apache#24246 existing tests Closes apache#24416 from cloud-fan/move. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request? Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core. I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go. Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module. This PR can solve the problem we have in apache#24246 ## How was this patch tested? existing tests Closes apache#24416 from cloud-fan/move. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
* @param ident a table identifier | ||
* @return true if a table was deleted, false if no table exists for the identifier | ||
*/ | ||
boolean dropTable(Identifier ident); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rdblue dropTable() here does not support to specify purge or not. Would you please share your idea about why it is designed like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purge is an option specific to some sources, like Hive. If you drop a table in most JDBC databases, you don't have an option to keep the data around somewhere. If an option can't be satisfied by sources, then it isn't a good candidate to be in the API.
Even in Hive, purge is optional because managed tables will drop data automatically, but external tables will not. Using table configuration for sources that support "external" data is a better option, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now maybe we can fail the analysis if PURGE is specified but we are dropping a v2 table. @waterlx would you like to send a PR to do it?
What changes were proposed in this pull request?
This adds the TableCatalog API proposed in the Table Metadata API SPIP.
For
TableCatalog
to useTable
, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required movingTableCapability
. Most of the files touched by this PR are import changes needed by this move.How was this patch tested?
This adds a test implementation and contract tests.