[SPARK-24923][SQL] Implement v2 CreateTableAsSelect #24570

rdblue · 2019-05-09T19:38:37Z

What changes were proposed in this pull request?

This adds a v2 implementation for CTAS queries

Update the SQL parser to parse CREATE queries using multi-part identifiers
Update CheckAnalysis to validate partitioning references with the CTAS query schema
Add CreateTableAsSelect v2 logical plan and CreateTableAsSelectExec v2 physical plan
Update create conversion from CreateTableAsSelectStatement to support the new v2 logical plan
Update DataSourceV2Strategy to convert v2 CTAS logical plan to the new physical plan
Add findNestedField to StructType to support reference validation

How was this patch tested?

We have been running these changes in production for several months. Also:

Add a test suite CreateTablePartitioningValidationSuite for new analysis checks
Add a test suite for v2 SQL, DataSourceV2SQLSuite
Update catalyst DDLParserSuite to use multi-part identifiers (Seq[String])
Add test cases to PlanResolutionSuite for v2 CTAS: known catalog and v2 source implementation

SparkQA · 2019-05-09T19:52:20Z

Test build #105292 has finished for PR 24570 at commit b13a8e2.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CreateTableAsSelect(
case class DataSourceResolution(
case class CreateTableAsSelectExec(

SparkQA · 2019-05-10T02:05:03Z

Test build #105296 has finished for PR 24570 at commit a22c335.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-10T02:13:22Z

Test build #105297 has finished for PR 24570 at commit cdf3805.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-05-10T15:36:45Z

@mccheah, @cloud-fan, here's a v2 implementation for CTAS. Please review if you have time. Thanks!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

cloud-fan · 2019-05-13T08:30:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala


      val tableDesc = buildCatalogTable(table, new StructType, partitionCols, bucketSpec,
        properties, provider, options, location, comment, ifNotExists)
      val mode = if (ifNotExists) SaveMode.Ignore else SaveMode.ErrorIfExists

      CreateTable(tableDesc, mode, Some(query))
+
+    case create: CreateTableAsSelectStatement =>


shall we implement CREATE TABLE as well?

I thought this PR was getting a little large, so I was going to do create table in a separate one. I can add it to this one if you'd prefer.

rdblue · 2019-05-13T19:23:05Z

@cloud-fan, I've updated this PR and fixed your review comments, so please have another look. Also, I added a test suite that I accidentally didn't add, DataSourceV2SQLSuite. That is where I will be putting end-to-end test cases that use in-memory tables to validate the physical plans.

SparkQA · 2019-05-13T22:25:05Z

Test build #105364 has finished for PR 24570 at commit b19a70d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mccheah · 2019-05-14T00:44:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

+
+    // convert USING, LOCATION, and COMMENT clauses to table properties
+    properties += ("provider" -> ctas.provider)
+    ctas.comment.map(text => properties += ("comment" -> text))


What happens if these overwrite table properties of the same type in the properties part of the SQL statement itself? Do these overwrite properties of the same name? Are these special property keys documented anywhere?

This is a good point. When saving Spark tables to Hive metastore, we need to store some spark specific information in the table properties. And we always use spark.sql. as the prefix for the property keys.

~~Shall we follow it and add a prefix here as well?~~

EDIT:
Since data source implementations need to know these properties, shall we just document these special properties?

I've added validations so that the properties and the clauses cannot both be used. Setting "comment" and using a COMMENT clause will result in an AnalysisException.

Passing these as well-known properties was included in the SPIP, but exactly which properties are used should be documented. I'll open a documentation issues and add it as a blocker for the 3.0 release.

If we want to define a prefix for the clauses that are passed as properties, that sounds fine to me. What is a reasonable prefix?

With the latest changes, conflicting properties are no longer allowed.

Also, I opened https://issues.apache.org/jira/browse/SPARK-27708 to track documentation.

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

mccheah · 2019-05-14T01:11:51Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

+              .withQueryId(UUID.randomUUID().toString)
+          val batchWrite = builder match {
+            case supportsSaveMode: SupportsSaveMode =>
+              supportsSaveMode.mode(SaveMode.Append).buildForBatch()


I'm not sure why we have to specifically tell the writer to use append mode. Can you elaborate? I think I'm missing something. It would be simpler to remove this branch entirely if possible.

I think SupportsSaveMode is a hack in TableProvider only. Seems we don't need to deal with it for TableCatalog. Anyway it will be removed soon.

cloud-fan · 2019-05-14T05:40:36Z

...st/scala/org/apache/spark/sql/catalyst/analysis/CreateTablePartitioningValidationSuite.scala

+    assert(!plan.resolved)
+    assertAnalysisError(plan, Seq(
+      "Invalid partitioning",
+      "does_not_exist.z is missing or is in a map or array"))


Not a blocker, but we can improve the error message in the future. It's better to let users know which column/field is missing. For example, a.b, it's possible that column a exists but a is not a struct or it doesn't have a b field.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala

cloud-fan · 2019-05-14T05:55:26Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala

+
+  before {
+    spark.conf.set("spark.sql.catalog.testcat", classOf[TestInMemoryTableCatalog].getName)
+    spark.conf.set("spark.sql.default.catalog", "testcat")


I think it has been moved to another PR?

It has. Does this need to be removed? I made the tests that rely on it ignored, so we should just need to revert that. I'd rather keep it so we don't have to figure out why those tests are failing when the other PR is merged.

SparkQA · 2019-05-14T20:04:35Z

Test build #105387 has finished for PR 24570 at commit 99ebc00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-15T03:24:13Z

thanks, merging to master!

rdblue · 2019-05-15T15:51:03Z

Thanks for reviewing, @cloud-fan and @mccheah! I'll get the next few PRs posted.

This adds a v2 implementation for CTAS queries * Update the SQL parser to parse CREATE queries using multi-part identifiers * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan * Add `findNestedField` to `StructType` to support reference validation We have been running these changes in production for several months. Also: * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks * Add a test suite for v2 SQL, `DataSourceV2SQLSuite` * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`) * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation Closes apache#24570 from rdblue/SPARK-24923-add-v2-ctas. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Implement v2 CreateTableAsSelect.

b13a8e2

rdblue added 2 commits May 9, 2019 16:02

Add v2 SQL test suite.

a22c335

Update HiveSessionStateBuilder.

cdf3805

cloud-fan reviewed May 13, 2019

View reviewed changes

rdblue added 3 commits May 13, 2019 11:56

Minor updates after comments.

a214143

Add missing test suite.

fe01700

Remove default catalog.

b19a70d

mccheah reviewed May 14, 2019

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

mccheah reviewed May 14, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

mccheah reviewed May 14, 2019

View reviewed changes

cloud-fan reviewed May 14, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala Outdated Show resolved Hide resolved

cloud-fan reviewed May 14, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceResolution.scala Show resolved Hide resolved

cloud-fan reviewed May 14, 2019

View reviewed changes

rdblue added 2 commits May 14, 2019 09:57

More changes from the review.

8b6d8ba

Remove unnecessary declared type.

99ebc00

cloud-fan closed this in 2da5b21 May 15, 2019

rdblue mentioned this pull request May 15, 2019

[SPARK-27693][SQL] Add default catalog property #24594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24923][SQL] Implement v2 CreateTableAsSelect #24570

[SPARK-24923][SQL] Implement v2 CreateTableAsSelect #24570

rdblue commented May 9, 2019 •

edited

Loading

SparkQA commented May 9, 2019

SparkQA commented May 10, 2019

SparkQA commented May 10, 2019

rdblue commented May 10, 2019

cloud-fan May 13, 2019

rdblue May 13, 2019

rdblue commented May 13, 2019

SparkQA commented May 13, 2019

mccheah May 14, 2019

cloud-fan May 14, 2019 •

edited

Loading

rdblue May 14, 2019

rdblue May 14, 2019

mccheah May 14, 2019

cloud-fan May 14, 2019

cloud-fan May 14, 2019

cloud-fan May 14, 2019

rdblue May 14, 2019

SparkQA commented May 14, 2019

cloud-fan commented May 15, 2019

rdblue commented May 15, 2019

[SPARK-24923][SQL] Implement v2 CreateTableAsSelect #24570

[SPARK-24923][SQL] Implement v2 CreateTableAsSelect #24570

Conversation

rdblue commented May 9, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 9, 2019

SparkQA commented May 10, 2019

SparkQA commented May 10, 2019

rdblue commented May 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented May 13, 2019

SparkQA commented May 13, 2019

Choose a reason for hiding this comment

cloud-fan May 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 14, 2019

cloud-fan commented May 15, 2019

rdblue commented May 15, 2019

rdblue commented May 9, 2019 •

edited

Loading

cloud-fan May 14, 2019 •

edited

Loading