[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog #14155

cloud-fan · 2016-07-12T14:11:12Z

What changes were proposed in this pull request?

Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.

This PR moves these hacks to HiveExternalCatalog, tries to isolate hive specific logic in one place.

changes overview:

before this PR: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. InMemoryCatalog)
after this PR: the table properties tricks are only in HiveExternalCatalog, the caller side doesn't need to take care of it anymore.
before this PR: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in DescribeTableCommand we will read schema and partition columns from table properties.
after this PR: The table metadata read from external catalog is exactly the same with what we saved to it.

bonus: now we can create data source table using SessionCatalog, if schema is specified.
breaks: schemaStringLengthThreshold is not configurable anymore. hive.default.rcfile.serde is not configurable anymore.

How was this patch tested?

existing tests.

SparkQA · 2016-07-12T14:19:54Z

Test build #62170 has finished for PR 14155 at commit d519968.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateDataSourceTableCommand(table: CatalogTable, ifNotExists: Boolean)

hvanhovell · 2016-07-12T14:27:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParseDriver.scala

@@ -34,7 +34,7 @@ import org.apache.spark.sql.types.DataType
 abstract class AbstractSqlParser extends ParserInterface with Logging {

  /** Creates/Resolves DataType for a given SQL string. */
-  def parseDataType(sqlText: String): DataType = parse(sqlText) { parser =>
+  override def parseDataType(sqlText: String): DataType = parse(sqlText) { parser =>
    // TODO add this to the parser interface.


Remove TODO :)

cloud-fan · 2016-07-13T21:45:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -49,12 +50,12 @@ case class CatalogStorageFormat(
    outputFormat: Option[String],
    serde: Option[String],
    compressed: Boolean,
-    serdeProperties: Map[String, String]) {
+    properties: Map[String, String]) {


rename it to properties, as data source table also store its options here, which has nothing to do with serde

SparkQA · 2016-07-13T23:00:42Z

Test build #62282 has finished for PR 14155 at commit b8e0eee.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateDataSourceTableCommand(table: CatalogTable, ifNotExists: Boolean)

yhuai · 2016-07-19T04:48:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -146,6 +151,15 @@ case class CatalogTable(
  requireSubsetOfSchema(sortColumnNames, "sort")
  requireSubsetOfSchema(bucketColumnNames, "bucket")

+  lazy val userSpecifiedSchema: Option[StructType] = if (schema.nonEmpty) {


What is this?

oh, having this is because CatalogColumn is using string as the type? I think we should just use StructType as the schema and remove CatalogColumn?

I'm not quite sure if it's safe to so, why do we have CatalogColumn at the first place?

SparkQA · 2016-07-19T09:06:57Z

Test build #62515 has finished for PR 14155 at commit bb06818.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-07-19T15:15:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -136,15 +141,14 @@ case class CatalogTable(
    comment: Option[String] = None,
    unsupportedFeatures: Seq[String] = Seq.empty) {

-  // Verify that the provided columns are part of the schema


I'll move this check to somewhere else

SparkQA · 2016-07-19T16:28:52Z

Test build #62534 has finished for PR 14155 at commit 1586465.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-20T18:22:25Z

Test build #62614 has finished for PR 14155 at commit 4ce7e2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-26T16:56:55Z

Test build #62892 has finished for PR 14155 at commit 3c913f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-28T15:20:51Z

Test build #62971 has finished for PR 14155 at commit 63fd9ed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-29T04:27:57Z

Test build #62992 has finished for PR 14155 at commit dddda52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-01T06:49:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

@@ -359,40 +357,6 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
    }
  }

-  test("Describe Table with Corrupted Schema") {


I moved this test to MetastoreDataSourceSuite, as now the error will happen when we read the table metadata.

SparkQA · 2016-08-01T08:47:31Z

Test build #63073 has finished for PR 14155 at commit 9ae7a71.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable
- case class MonotonicallyIncreasingID() extends LeafExpression with Nondeterministic
- case class SparkPartitionID() extends LeafExpression with Nondeterministic
- case class AggregateExpression(
- case class CurrentDatabase() extends LeafExpression with Unevaluable
- class GenericInternalRow(val values: Array[Any]) extends BaseGenericInternalRow
- class AbstractScalaRowIterator[T] extends Iterator[T]

gatorsmile · 2016-08-18T22:42:38Z

Trying to draw a data flow for the new changes. I am wondering if we should also change the interface of CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand. Use CatalogTable as inputs and outputs in the whole flow?

Update: in DDLStrategy, we do not check the validity of tableDesc (CatalogTable). We have multiple ways to create tables. Thus, I am afraid the future code changes might cause some bugs. IMO, we just pass the whole tableDesc to the RunnableCommand, instead passing a subset of them. We can do more checking in both RunnableCommand and createTable APIs, which are actually consume them.

More update: CreateHiveTableAsSelectCommand is using tableDesc: CatalogTable. I am fine if we want to unify them later. Then, I will focus on checking the data flow of each field.

gatorsmile · 2016-08-18T22:45:33Z

Also attach the flow of the existing one. The new one looks much cleaner! : )

cloud-fan · 2016-08-19T01:05:36Z

I am wondering if we should also change the interface of CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand. Use CatalogTable as inputs and outputs in the whole flow?

yea, we should do it in follow up.

gatorsmile · 2016-08-19T01:48:01Z

: ) Then, it becomes very straightforward to combine CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand into the same node.

Now, let me check the data flow of each field.

gatorsmile · 2016-08-19T03:56:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala


+    sessionState.catalog.createTable(table, ignoreIfExists)


We made a change here. Before, ignoreIfExists is always set to false when we call createTable. Now, if we want to let the underlying createTable handles it, we should remove the code: https://github.com/cloud-fan/spark/blob/96d57b665ac65750eb5c6f9757e5827ea9c14ca4/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L58-L64

When we hit this branch, the table does not exist, so ignoreIfExists doesn't matter here. I'll change it to false and add some comments.

gatorsmile · 2016-08-19T06:31:32Z

Above is the data flow of all the fields in CatalogTable for Create Data Source Tables and CTAS. : )

gatorsmile · 2016-08-19T06:43:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala

+          provider = Some(provider),
+          partitionColumnNames = getPartitionColumnsFromTableProperties(table),
+          bucketSpec = getBucketSpecFromTableProperties(table),
+          properties = getOriginalTableProperties(table))


If you see the data flow, you might realize the table properties are always empty for data source tables after CREATE TABLE or CTAS commands. For data source tables, the actual table properties are stored in the storage properties, right?

you can take a look at the SQL syntax for data source table, we can't set table properties for data source tables currently. But external catalog doesn't need to make this assumption.

BTW data source options are stored in storage properties, but they are not table properties.

We do not have the actual table properties for data source tables. The options functions like table properties here?

BTW, found a bug when we convert Hive Serde CTAS to Data Source CTAS. We lost table properties in that case. Will submit a PR soon.

We just use table properties to store schema and metadata that is not defined by users. All user options will be stored in serde properties.

To users, there is no table property or serde property. They only see options.

Yeah, but when users describe the data source tables, they will see these options in serde properties.

The issues become more complicated when we support conversion from Hive Serde tables to Data Source Tables. The actual table properties will be lost in some cases.

The previous code also store options to serde properties, I'm not going to fix everything in this PR, and I'm not sure if it's a real problem, but let's continue the discussion in follow-up.

just want to double check. We are not talking about not using serde properties to store options, right?

we are talking about store data source table properties into storage properties, and the discussion is over, we won't do this, see #14727

SparkQA · 2016-08-19T11:10:33Z

Test build #64058 has finished for PR 14155 at commit 6ca8909.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-08-19T21:50:45Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -175,7 +127,8 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
    } else {
      val qualifiedTable =
        MetastoreRelation(
-          qualifiedTableName.database, qualifiedTableName.name)(table, client, sparkSession)
+          qualifiedTableName.database, qualifiedTableName.name)(
+          table.copy(provider = Some("hive")), client, sparkSession)


If we use the ExternalCatalog API to fetch table metadata, we do not need this change. That means, we just need to update the following line:

https://github.com/cloud-fan/spark/blob/6ca8909d355b14abcc0099a53928bba437d98442/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L113

Then we will restore table metadata from table properties twice. As this class will be removed soon, I don't want to change too much.

gatorsmile · 2016-08-20T05:06:54Z

LGTM pending tests

SparkQA · 2016-08-20T06:27:13Z

Test build #64122 has finished for PR 14155 at commit 38b838a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-20T15:32:32Z

retest this please

SparkQA · 2016-08-20T17:30:40Z

Test build #64149 has finished for PR 14155 at commit 38b838a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-08-22T05:20:44Z

LGTM. There are two things that we need to address in follow-up prs. The first one is if we can consolidate location in CatalogStorageFormat and path in options. The second one is to read the conf value of spark.sql.sources.schemaStringLengthThreshold in ExternalCatalog.

I am merging this to master.

hvanhovell reviewed Jul 12, 2016
View reviewed changes

cloud-fan force-pushed the catalog-table branch from d519968 to b8e0eee Compare July 13, 2016 21:44

cloud-fan reviewed Jul 13, 2016
View reviewed changes

cloud-fan mentioned this pull request Jul 13, 2016

[SPARK-16397][SQL] make CatalogTable more general and less hive specific #14071

Closed

yhuai reviewed Jul 19, 2016
View reviewed changes

cloud-fan changed the title ~~[SPARK-16498][SQL][WIP] move hive hack for data source table into HiveExternalCatalog~~ [SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog Jul 19, 2016

cloud-fan reviewed Jul 19, 2016
View reviewed changes

cloud-fan force-pushed the catalog-table branch from 4ce7e2f to 3c913f3 Compare July 26, 2016 15:43

use StructType in CatalogTable and remove CatalogColumn

b781ef8

cloud-fan force-pushed the catalog-table branch from 3c913f3 to 63fd9ed Compare July 28, 2016 13:51

move hive hack for data source table into HiveExternalCatalog

dddda52

cloud-fan force-pushed the catalog-table branch from 63fd9ed to dddda52 Compare July 29, 2016 02:27

Merge remote-tracking branch 'origin/master' into catalog-table

9ae7a71

cloud-fan reviewed Aug 1, 2016
View reviewed changes

more clean up

4383590

cloud-fan mentioned this pull request Aug 19, 2016

[SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query #14690

Closed

gatorsmile reviewed Aug 19, 2016
View reviewed changes

address comments

6ca8909

cloud-fan mentioned this pull request Aug 19, 2016

[SPARK-17144] [SQL] Removal of useless CreateHiveTableAsSelectLogicalPlan #14707

Closed

gatorsmile reviewed Aug 19, 2016
View reviewed changes

address comment

38b838a

asfgit closed this in b2074b6 Aug 22, 2016

cloud-fan deleted the catalog-table branch December 14, 2016 12:33

xkrogen mentioned this pull request Jan 31, 2023

[SPARK-42262][SQL] Table schema changes via V2SessionCatalog with HiveExternalCatalog #39826

Closed

[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog #14155

[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog #14155

Conversation

cloud-fan commented Jul 12, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 19, 2016

Choose a reason for hiding this comment

SparkQA commented Jul 19, 2016

SparkQA commented Jul 20, 2016

SparkQA commented Jul 26, 2016

SparkQA commented Jul 28, 2016

SparkQA commented Jul 29, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 1, 2016

gatorsmile commented Aug 18, 2016 • edited Loading

gatorsmile commented Aug 18, 2016

cloud-fan commented Aug 19, 2016

gatorsmile commented Aug 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Aug 19, 2016

Choose a reason for hiding this comment

cloud-fan Aug 19, 2016 • edited Loading

Choose a reason for hiding this comment

gatorsmile Aug 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Aug 22, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Aug 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Aug 20, 2016 • edited Loading

SparkQA commented Aug 20, 2016

cloud-fan commented Aug 20, 2016

SparkQA commented Aug 20, 2016

yhuai commented Aug 22, 2016

cloud-fan commented Jul 12, 2016 •

edited

Loading

gatorsmile commented Aug 18, 2016 •

edited

Loading

cloud-fan Aug 19, 2016 •

edited

Loading

gatorsmile Aug 19, 2016 •

edited

Loading

cloud-fan Aug 22, 2016 •

edited

Loading

gatorsmile commented Aug 20, 2016 •

edited

Loading