Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog #14155

Closed
wants to merge 19 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jul 12, 2016

What changes were proposed in this pull request?

Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.

This PR moves these hacks to HiveExternalCatalog, tries to isolate hive specific logic in one place.

changes overview:

  1. before this PR: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. InMemoryCatalog)
    after this PR: the table properties tricks are only in HiveExternalCatalog, the caller side doesn't need to take care of it anymore.
  2. before this PR: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in DescribeTableCommand we will read schema and partition columns from table properties.
    after this PR: The table metadata read from external catalog is exactly the same with what we saved to it.

bonus: now we can create data source table using SessionCatalog, if schema is specified.
breaks: schemaStringLengthThreshold is not configurable anymore. hive.default.rcfile.serde is not configurable anymore.

How was this patch tested?

existing tests.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62170 has finished for PR 14155 at commit d519968.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateDataSourceTableCommand(table: CatalogTable, ifNotExists: Boolean)

@@ -34,7 +34,7 @@ import org.apache.spark.sql.types.DataType
abstract class AbstractSqlParser extends ParserInterface with Logging {

/** Creates/Resolves DataType for a given SQL string. */
def parseDataType(sqlText: String): DataType = parse(sqlText) { parser =>
override def parseDataType(sqlText: String): DataType = parse(sqlText) { parser =>
// TODO add this to the parser interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove TODO :)

@@ -49,12 +50,12 @@ case class CatalogStorageFormat(
outputFormat: Option[String],
serde: Option[String],
compressed: Boolean,
serdeProperties: Map[String, String]) {
properties: Map[String, String]) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename it to properties, as data source table also store its options here, which has nothing to do with serde

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62282 has finished for PR 14155 at commit b8e0eee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateDataSourceTableCommand(table: CatalogTable, ifNotExists: Boolean)

@@ -146,6 +151,15 @@ case class CatalogTable(
requireSubsetOfSchema(sortColumnNames, "sort")
requireSubsetOfSchema(bucketColumnNames, "bucket")

lazy val userSpecifiedSchema: Option[StructType] = if (schema.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, having this is because CatalogColumn is using string as the type? I think we should just use StructType as the schema and remove CatalogColumn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure if it's safe to so, why do we have CatalogColumn at the first place?

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62515 has finished for PR 14155 at commit bb06818.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-16498][SQL][WIP] move hive hack for data source table into HiveExternalCatalog [SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog Jul 19, 2016
@@ -136,15 +141,14 @@ case class CatalogTable(
comment: Option[String] = None,
unsupportedFeatures: Seq[String] = Seq.empty) {

// Verify that the provided columns are part of the schema
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move this check to somewhere else

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62534 has finished for PR 14155 at commit 1586465.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62614 has finished for PR 14155 at commit 4ce7e2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62892 has finished for PR 14155 at commit 3c913f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62971 has finished for PR 14155 at commit 63fd9ed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2016

Test build #62992 has finished for PR 14155 at commit dddda52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -359,40 +357,6 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach {
}
}

test("Describe Table with Corrupted Schema") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this test to MetastoreDataSourceSuite, as now the error will happen when we read the table metadata.

@SparkQA
Copy link

SparkQA commented Aug 1, 2016

Test build #63073 has finished for PR 14155 at commit 9ae7a71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable
    • case class MonotonicallyIncreasingID() extends LeafExpression with Nondeterministic
    • case class SparkPartitionID() extends LeafExpression with Nondeterministic
    • case class AggregateExpression(
    • case class CurrentDatabase() extends LeafExpression with Unevaluable
    • class GenericInternalRow(val values: Array[Any]) extends BaseGenericInternalRow
    • class AbstractScalaRowIterator[T] extends Iterator[T]

@gatorsmile
Copy link
Member

gatorsmile commented Aug 18, 2016

newcreatedatasourcetable

Trying to draw a data flow for the new changes. I am wondering if we should also change the interface of CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand. Use CatalogTable as inputs and outputs in the whole flow?

Update: in DDLStrategy, we do not check the validity of tableDesc (CatalogTable). We have multiple ways to create tables. Thus, I am afraid the future code changes might cause some bugs. IMO, we just pass the whole tableDesc to the RunnableCommand, instead passing a subset of them. We can do more checking in both RunnableCommand and createTable APIs, which are actually consume them.

More update: CreateHiveTableAsSelectCommand is using tableDesc: CatalogTable. I am fine if we want to unify them later. Then, I will focus on checking the data flow of each field.

@gatorsmile
Copy link
Member

Also attach the flow of the existing one. The new one looks much cleaner! : )

createdatasourcedataflow

@cloud-fan
Copy link
Contributor Author

I am wondering if we should also change the interface of CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand. Use CatalogTable as inputs and outputs in the whole flow?

yea, we should do it in follow up.

@gatorsmile
Copy link
Member

: ) Then, it becomes very straightforward to combine CreateDataSourceTableCommand and CreateDataSourceTableAsSelectCommand into the same node.

Now, let me check the data flow of each field.


sessionState.catalog.createTable(table, ignoreIfExists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made a change here. Before, ignoreIfExists is always set to false when we call createTable. Now, if we want to let the underlying createTable handles it, we should remove the code: https://github.com/cloud-fan/spark/blob/96d57b665ac65750eb5c6f9757e5827ea9c14ca4/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L58-L64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we hit this branch, the table does not exist, so ignoreIfExists doesn't matter here. I'll change it to false and add some comments.

@gatorsmile
Copy link
Member

newcreatedatasourcetable-2

Above is the data flow of all the fields in CatalogTable for Create Data Source Tables and CTAS. : )

provider = Some(provider),
partitionColumnNames = getPartitionColumnsFromTableProperties(table),
bucketSpec = getBucketSpecFromTableProperties(table),
properties = getOriginalTableProperties(table))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see the data flow, you might realize the table properties are always empty for data source tables after CREATE TABLE or CTAS commands. For data source tables, the actual table properties are stored in the storage properties, right?

Copy link
Contributor Author

@cloud-fan cloud-fan Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can take a look at the SQL syntax for data source table, we can't set table properties for data source tables currently. But external catalog doesn't need to make this assumption.

BTW data source options are stored in storage properties, but they are not table properties.

Copy link
Member

@gatorsmile gatorsmile Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have the actual table properties for data source tables. The options functions like table properties here?

BTW, found a bug when we convert Hive Serde CTAS to Data Source CTAS. We lost table properties in that case. Will submit a PR soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just use table properties to store schema and metadata that is not defined by users. All user options will be stored in serde properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To users, there is no table property or serde property. They only see options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but when users describe the data source tables, they will see these options in serde properties.

The issues become more complicated when we support conversion from Hive Serde tables to Data Source Tables. The actual table properties will be lost in some cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code also store options to serde properties, I'm not going to fix everything in this PR, and I'm not sure if it's a real problem, but let's continue the discussion in follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to double check. We are not talking about not using serde properties to store options, right?

Copy link
Contributor Author

@cloud-fan cloud-fan Aug 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are talking about store data source table properties into storage properties, and the discussion is over, we won't do this, see #14727

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64058 has finished for PR 14155 at commit 6ca8909.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -175,7 +127,8 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
} else {
val qualifiedTable =
MetastoreRelation(
qualifiedTableName.database, qualifiedTableName.name)(table, client, sparkSession)
qualifiedTableName.database, qualifiedTableName.name)(
table.copy(provider = Some("hive")), client, sparkSession)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the ExternalCatalog API to fetch table metadata, we do not need this change. That means, we just need to update the following line:

https://github.com/cloud-fan/spark/blob/6ca8909d355b14abcc0099a53928bba437d98442/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L113

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we will restore table metadata from table properties twice. As this class will be removed soon, I don't want to change too much.

@gatorsmile
Copy link
Member

gatorsmile commented Aug 20, 2016

LGTM pending tests

@SparkQA
Copy link

SparkQA commented Aug 20, 2016

Test build #64122 has finished for PR 14155 at commit 38b838a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 20, 2016

Test build #64149 has finished for PR 14155 at commit 38b838a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Aug 22, 2016

LGTM. There are two things that we need to address in follow-up prs. The first one is if we can consolidate location in CatalogStorageFormat and path in options. The second one is to read the conf value of spark.sql.sources.schemaStringLengthThreshold in ExternalCatalog.

I am merging this to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants