-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog #14155
Conversation
Test build #62170 has finished for PR 14155 at commit
|
@@ -34,7 +34,7 @@ import org.apache.spark.sql.types.DataType | |||
abstract class AbstractSqlParser extends ParserInterface with Logging { | |||
|
|||
/** Creates/Resolves DataType for a given SQL string. */ | |||
def parseDataType(sqlText: String): DataType = parse(sqlText) { parser => | |||
override def parseDataType(sqlText: String): DataType = parse(sqlText) { parser => | |||
// TODO add this to the parser interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove TODO :)
@@ -49,12 +50,12 @@ case class CatalogStorageFormat( | |||
outputFormat: Option[String], | |||
serde: Option[String], | |||
compressed: Boolean, | |||
serdeProperties: Map[String, String]) { | |||
properties: Map[String, String]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename it to properties
, as data source table also store its options
here, which has nothing to do with serde
Test build #62282 has finished for PR 14155 at commit
|
@@ -146,6 +151,15 @@ case class CatalogTable( | |||
requireSubsetOfSchema(sortColumnNames, "sort") | |||
requireSubsetOfSchema(bucketColumnNames, "bucket") | |||
|
|||
lazy val userSpecifiedSchema: Option[StructType] = if (schema.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, having this is because CatalogColumn
is using string as the type? I think we should just use StructType
as the schema and remove CatalogColumn
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure if it's safe to so, why do we have CatalogColumn
at the first place?
Test build #62515 has finished for PR 14155 at commit
|
@@ -136,15 +141,14 @@ case class CatalogTable( | |||
comment: Option[String] = None, | |||
unsupportedFeatures: Seq[String] = Seq.empty) { | |||
|
|||
// Verify that the provided columns are part of the schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move this check to somewhere else
Test build #62534 has finished for PR 14155 at commit
|
Test build #62614 has finished for PR 14155 at commit
|
Test build #62892 has finished for PR 14155 at commit
|
Test build #62971 has finished for PR 14155 at commit
|
Test build #62992 has finished for PR 14155 at commit
|
@@ -359,40 +357,6 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { | |||
} | |||
} | |||
|
|||
test("Describe Table with Corrupted Schema") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this test to MetastoreDataSourceSuite
, as now the error will happen when we read the table metadata.
Test build #63073 has finished for PR 14155 at commit
|
Trying to draw a data flow for the new changes. I am wondering if we should also change the interface of Update: in More update: |
yea, we should do it in follow up. |
: ) Then, it becomes very straightforward to combine Now, let me check the data flow of each field. |
|
||
sessionState.catalog.createTable(table, ignoreIfExists) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made a change here. Before, ignoreIfExists
is always set to false
when we call createTable
. Now, if we want to let the underlying createTable
handles it, we should remove the code: https://github.com/cloud-fan/spark/blob/96d57b665ac65750eb5c6f9757e5827ea9c14ca4/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L58-L64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we hit this branch, the table does not exist, so ignoreIfExists
doesn't matter here. I'll change it to false
and add some comments.
provider = Some(provider), | ||
partitionColumnNames = getPartitionColumnsFromTableProperties(table), | ||
bucketSpec = getBucketSpecFromTableProperties(table), | ||
properties = getOriginalTableProperties(table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you see the data flow, you might realize the table properties are always empty for data source tables after CREATE TABLE or CTAS commands. For data source tables, the actual table properties are stored in the storage properties, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can take a look at the SQL syntax for data source table, we can't set table properties for data source tables currently. But external catalog doesn't need to make this assumption.
BTW data source options are stored in storage properties, but they are not table properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have the actual table properties for data source tables. The options
functions like table properties here?
BTW, found a bug when we convert Hive Serde CTAS to Data Source CTAS. We lost table properties in that case. Will submit a PR soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just use table properties to store schema and metadata that is not defined by users. All user options will be stored in serde properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To users, there is no table property or serde property. They only see options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but when users describe the data source tables, they will see these options in serde properties.
The issues become more complicated when we support conversion from Hive Serde tables to Data Source Tables. The actual table properties will be lost in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code also store options to serde properties, I'm not going to fix everything in this PR, and I'm not sure if it's a real problem, but let's continue the discussion in follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to double check. We are not talking about not using serde properties to store options, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are talking about store data source table properties into storage properties, and the discussion is over, we won't do this, see #14727
Test build #64058 has finished for PR 14155 at commit
|
@@ -175,7 +127,8 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log | |||
} else { | |||
val qualifiedTable = | |||
MetastoreRelation( | |||
qualifiedTableName.database, qualifiedTableName.name)(table, client, sparkSession) | |||
qualifiedTableName.database, qualifiedTableName.name)( | |||
table.copy(provider = Some("hive")), client, sparkSession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the ExternalCatalog
API to fetch table metadata, we do not need this change. That means, we just need to update the following line:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we will restore table metadata from table properties twice. As this class will be removed soon, I don't want to change too much.
LGTM pending tests |
Test build #64122 has finished for PR 14155 at commit
|
retest this please |
Test build #64149 has finished for PR 14155 at commit
|
LGTM. There are two things that we need to address in follow-up prs. The first one is if we can consolidate location in CatalogStorageFormat and I am merging this to master. |
What changes were proposed in this pull request?
Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.
This PR moves these hacks to
HiveExternalCatalog
, tries to isolate hive specific logic in one place.changes overview:
InMemoryCatalog
)after this PR: the table properties tricks are only in
HiveExternalCatalog
, the caller side doesn't need to take care of it anymore.DescribeTableCommand
we will read schema and partition columns from table properties.after this PR: The table metadata read from external catalog is exactly the same with what we saved to it.
bonus: now we can create data source table using
SessionCatalog
, if schema is specified.breaks:
schemaStringLengthThreshold
is not configurable anymore.hive.default.rcfile.serde
is not configurable anymore.How was this patch tested?
existing tests.