[SPARK-25530][SQL] data source v2 API refactor (batch write) #23208

cloud-fan · 2018-12-03T15:23:45Z

What changes were proposed in this pull request?

Adjust the batch write API to match the read API refactor after #23086

The doc with high-level ideas:
https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing

Basically it renames BatchWriteSupportProvider to SupportsBatchWrite, and make it extend Table. Renames WriteSupport to Write. It also cleans up some code as batch API is completed.

This PR also removes the test from #22688 . Now data source must return a table for read/write.

A few notes about future changes:

We will create SupportsStreamingWrite later for streaming APIs
We will create SupportsBatchReplaceWhere, SupportsBatchAppend, etc. for the new end-user write APIs. I think streaming APIs would remain to use OutputMode, and new end-user write APIs will apply to batch only, at least in the near future.
We will remove SaveMode from data source API: https://issues.apache.org/jira/browse/SPARK-26356

How was this patch tested?

existing tests

cloud-fan · 2018-12-03T15:26:08Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java

+ * The major responsibility of this interface is to return a {@link Table} for read/write. If you
+ * want to allow end-users to write data to non-existing tables via write APIs in `DataFrameWriter`
+ * with `SaveMode`, you must return a {@link Table} instance even if the table doesn't exist. The
+ * table schema can be empty in this case.


Generally only file source(maybe JDBC data source as well) need to do it. For new data sources, I'd expect them to either implement only the new write APIs(replaceWhere, append, etc.), or well define the behavior of SaveMode.Append so that it fails if table doesn't exist.

What does it mean to write to a non-existing table? If you're writing somewhere, the table must exist.

This is for creating a table directly from configuration and an implementation class in the DataFrameWriter API. The target of the write still needs to exist.

"Exist" is a relative concept, I suppose. I think we need to somehow allow for create-on-write functionality, even if many table providers won't want to support it.

@jose-torres, create on write is done by CTAS. It should not be left up to the source whether to fail or create.

I think the confusion here is that this is a degenerate case where Spark has no ability to interact with the table's metadata. Spark must assume that it exists because the caller is writing to it.

The caller is indicating that a table exists, is identified by some configuration, and that a specific implementation can be used to write to it. That's what happens today when source implementations are directly specified.

Maybe it should also be part of the TableProvider contract that if the table can't be located, it throws an exception?

I think we can remove SaveMode right away. We don't need to break existing use cases if we add the OverwriteData plan and use it when the user's mode is Overwrite. That helps us get to the point where we can integrate SQL on top of this faster.

I'm not convinced it's safe to remove SaveMode right away, when there is only an Append operator implemented currently.

If we do it, it means DataFrameWriter.save need to throw an exception for a lot of cases, except when the mode is append. I don't think this is acceptable right now.

Can we discuss the removal of SaveMode at least after all the necessary new write operators are implemented?

SaveMode is incompatible with the SPIP to standarize behavior that was voted on and accepted. The save mode in DataFrameWriter must be used to create v2 plans that have well-defined behavior and cannot be passed to implementations in the final version of the v2 read/write API.

I see no reason to put off removing SaveMode from the API. If we remove it now, we will avoid having more versions of this API that are fundamentally broken. We will avoid more implementations that rely on it, not aware that it will be removed.

To your point about whether it is safe: the only case where this is actually used is SaveMode.Overwrite and SaveMode.Append. To replace those, all that needs to happen is to define what kind of overwrite should happen here (dynamic or truncate).

I can supply the logical plan and physical implementation in a follow-up PR because I already have all this written and waiting to go in. Or, I can add a PR to merge first if you'd like to have these changes depend on that implementation.

what about the file source behavior difference between SaveMode.Append and the new append operator? Are you saying we should accept it and ask users to change their code? file source is widely used with df.write.save API...

I suggest we use the v1 file source as the basis for the behavior for v2. You can see an implementation of that behavior in my other comment. If the table exists, overwrite is a dynamic partition overwrite, append is an append, and ignore does nothing. If the table doesn't exist, then the operation is a CTAS. (Note that we can also check properties to correctly mirror the behavior for static overwrite.)

Your concern is addressed by not using the Append plan when the file source would have needed to create the table.

The critical difference is that this behavior is all implemented in Spark instead of passing SaveMode to the source. If you pass SaveMode to the source, Spark can't guarantee that it is consistent across sources. We are trying to fix inconsistent behavior in v2.

cloud-fan · 2018-12-03T15:28:00Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+        case table: SupportsBatchWrite =>
+          val relation = DataSourceV2Relation.create(table, dsOptions)
+          // TODO: revisit it. We should not create the `AppendData` operator for `SaveMode.Append`.
+          // We should create new end-users APIs for the `AppendData` operator.


according to the discussion in #22688 (comment) , the behavior of append operator and SaveMode.Append can be different. We should revisit it when we have the new end-user write APIs.

The example in the referenced comment is this:

spark.range(1).format("source").write.save("non-existent-path")

If a path for a path-based table doesn't exist, then I think that the table doesn't exist. If a table doesn't exist, then the operation for save should be CTAS instead of AppendData.

Here, I think the right behavior is to check whether the provider returns a table. If it doesn't, then the table doesn't exist and the plan should be CTAS. If it does, then it must provide the schema used to validate the AppendData operation. Since we don't currently have CTAS, this should throw an exception stating that the table doesn't exist and can't be created.

More generally, the meaning of SaveMode with v1 is not always reliable. I think the right approach is what @cloud-fan suggests: create a new write API for v2 tables that is clear for the new logical plans (I've proposed one and would be happy to open a PR). Once the logical plans are in place, we can go back through this API and move it over to v2 where the behaviors match.

Here is what my branch uses for this logic:

val maybeTable = provider.getTable(identifier) val exists = maybeTable.isDefined (exists, mode) match { case (true, SaveMode.ErrorIfExists) => throw new AnalysisException(s"Table already exists: ${identifier.quotedString}") case (true, SaveMode.Overwrite) => val relation = DataSourceV2Relation.create( catalog.name, identifier, maybeTable.get, options) runCommand(df.sparkSession, "insertInto") { OverwritePartitionsDynamic.byName(relation, df.logicalPlan) } case (true, SaveMode.Append) => val relation = DataSourceV2Relation.create( catalog.name, identifier, maybeTable.get, options) runCommand(df.sparkSession, "save") { AppendData.byName(relation, df.logicalPlan) } case (false, _) => runCommand(df.sparkSession, "save") { CreateTableAsSelect(catalog, identifier, Seq.empty, df.logicalPlan, options, ignoreIfExists = mode == SaveMode.Ignore) } case _ => // table exists and mode is ignore }

The identifier handling would be different, but the basic idea is the same.

Also, in our environment we always use dynamic overwrites for the overwrite case. We would need to handle that depending on the environment.

yea, that's why I only left a comment and just ask for revisiting later. I think we can see a clearer picture after we migrating the file source.

I see no reason to make this API depend on migrating the file source. We know that SaveMode must be removed. It makes no sense to create a broken file source implementation and then remove this afterward.

cloud-fan · 2018-12-03T15:28:28Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

    output: Seq[AttributeReference],
-    options: Map[String, String],
-    userSpecifiedSchema: Option[StructType] = None)
+    // TODO: use a simple case insensitive map instead.


I'll do it in my next PR.

cloud-fan · 2018-12-03T15:30:12Z

cc @rdblue @rxin @jose-torres @gatorsmile @HyukjinKwon @gengliangwang

SparkQA · 2018-12-03T19:06:01Z

Test build #99620 has finished for PR 23208 at commit 00fc34f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-12-03T20:13:45Z

Thanks for posting this PR @cloud-fan! I'll have a look in the next day or so.

cloud-fan · 2018-12-04T04:12:51Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchWrite.java

@@ -25,14 +25,14 @@
 import org.apache.spark.sql.types.StructType;

 /**
- * A mix-in interface for {@link DataSourceV2}. Data sources can implement this interface to
+ * A mix-in interface for {@link Table}. Data sources can implement this interface to
 * provide data writing ability for batch processing.
 *
 * This interface is used to create {@link BatchWriteSupport} instances when end users run


I don't have a better naming in mind, so I leave it as WriteSupport for now. Better naming is welcome to match Scan!

What if we just call it BatchWrite

gengliangwang · 2018-12-06T13:38:30Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchWrite.java

 * provide data writing ability for batch processing.
 *
 * This interface is used to create {@link BatchWriteSupport} instances when end users run
 * {@code Dataset.write.format(...).option(...).save()}.
 */
 @Evolving
-public interface BatchWriteSupportProvider extends DataSourceV2 {
+public interface SupportsBatchWrite extends Table {


To me, it is quite confusing to have BatchWriteSupport and SupportsBatchWrite.

That's why I left #23208 (comment) .

namings are welcome!

Table exposes newScanBuilder without an interface. Why should the write side be different? Doesn't Spark support sources that are read-only and write-only?

I think that both reads and writes should use interfaces to mix support into Table or both should be exposed by Table and throw UnsupportedOperationException by default, not a mix of the two options.

If newWriteBuilder were added to Table, then this interface wouldn't be necessary and the name problem goes away.

I do think read-only or write-only is a necessary feature, according to what I've seen in the dev list. Maybe we should move newScanBuilder from Table to the mixin traits.

I'm fine either way, as long as we are consistent between the read and write sides.

rdblue · 2018-12-06T21:01:36Z

@cloud-fan, I see that this adds Table and uses TableProvider, but I was expecting this to also update the write side to mirror the read side, like PR #22190 for SPARK-25188 (originally proposed in discussion on SPARK-24882).

The main parts that we discussed there were:

Mirror the read side structure by adding WriteConfig. Now, that would be adding a WriteBuilder.
Mirroring the read life-cycle of ScanBuilder and Scan, to enable use cases like acquiring and holding a write lock, for example.
Using the WriteBuilder to expose more write configuration to support overwrite and dynamic partition overwrite.

We don't need to add the overwrite mix-ins here, but I would expect to see a WriteBuilder that returns a Writer. (Table -> WriteBuilder -> Write matches Table -> ScanBulder -> Scan.)

The Write would expose BatchWrite and StreamWrite (if they are different) or could directly expose the WriteFactory, commit, abort, etc.

WriteBuilder would be extensible so that SupportsOverwrite and SupportsDynamicOverwrite can be added as mix-ins at some point.

rdblue · 2018-12-06T21:03:05Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

-    options: Map[String, String],
-    userSpecifiedSchema: Option[StructType] = None)
+    // TODO: use a simple case insensitive map instead.
+    options: DataSourceOptions)


Why change this now, when DataSourceOptions will be replaced? I would say just leave it as a map and update it once later.

Because this makes the code cleaner, otherwise I need to write more code to convert a map to DataSourceOptions multiple times inside DataSourceV2Relation.

I don't have a strong preference here, and just pick the easiest approach for me. If you do think using a map here is clearer, I can add these extra code.

A private method to do that existed in the past. Why not just revive it?

It was done it multiple places before:
https://github.com/apache/spark/pull/23208/files#diff-35ba4ffb5ccb9b18b43226f1d5effa23L62
https://github.com/apache/spark/pull/23208/files#diff-35ba4ffb5ccb9b18b43226f1d5effa23L153
https://github.com/apache/spark/pull/23208/files#diff-35ba4ffb5ccb9b18b43226f1d5effa23L171

If you prefer it strongly, I can follow it and update the code.

I think it is a good idea to avoid needless churn, so I would prefer using the original Map[String, String].

cloud-fan · 2018-12-07T03:05:05Z

@rdblue I tried to add WriteBuilder, but there is a difference between read and write:

for read, the ScanBuilder can collect many information, like column pruning, filter pushdown, etc. together, and create a Scan
for write, it's just different branches, not a combination. e.g. you can't do append and replaceWhere at the same time.

Because of this, I feel we don't need WriterBuilder, but just different mixin traits to create Write for different purposes.

Let me know if you have other ideas. Thanks for your review!

rdblue · 2018-12-07T17:44:36Z

@cloud-fan, what are you suggesting to use as a design? If you think this shouldn't mirror the read side, then let's be clear on what it should look like. Maybe that's a design doc, or maybe that's a discussion thread on the mailing list.

Whatever option we go for, we still need to have a plan for exposing the replace-by-filter and replace-dynamic-partitions methods, whatever they end up being. We also need the life-cycle to match.

cloud-fan · 2018-12-09T13:04:53Z

Let's move the high level discussion to the doc: https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing

gengliangwang · 2018-12-10T07:13:08Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

-  def newScanBuilder(): ScanBuilder = {
-    val dsOptions = new DataSourceOptions(options.asJava)
-    table.newScanBuilder(dsOptions)
+  def newWriteSupport(inputSchema: StructType, mode: SaveMode): Optional[BatchWriteSupport] = {


Nit: add comment for the method. Especially when it will return None. Although it is explained in SupportsBatchWrite.createBatchWriteSupport

I would hold off on this discussion for now. I think this is going to require significant changes.

SparkQA · 2018-12-13T16:08:15Z

Test build #100101 has started for PR 23208 at commit 3bc08cc.

SparkQA · 2018-12-13T16:18:48Z

Test build #100090 has finished for PR 23208 at commit 701000d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-14T04:46:58Z

Test build #100121 has finished for PR 23208 at commit 0372475.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-14T08:05:02Z

Test build #100122 has finished for PR 23208 at commit 1c21df5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-14T14:32:51Z

Test build #100148 has finished for PR 23208 at commit da47189.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-14T14:55:28Z

Test build #100150 has finished for PR 23208 at commit bcea416.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-14T14:58:37Z

Test build #100149 has finished for PR 23208 at commit 4d2de39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-14T15:17:05Z

Test build #100151 has finished for PR 23208 at commit bdcb11f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-10T07:16:24Z

Test build #101000 has finished for PR 23208 at commit ee7acbc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-11T02:59:16Z

Test build #101047 has finished for PR 23208 at commit 87a5294.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-11T07:31:18Z

Test build #101051 has finished for PR 23208 at commit ec6129a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-01-14T22:10:54Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/SupportsSaveMode.java

+import org.apache.spark.sql.SaveMode;
+
+// A temporary mixin trait for `WriteBuilder` to support `SaveMode`. Will be removed before
+// Spark 3.0 when all the new write operators are finished.


What is the blocker issue that tracks this?

https://issues.apache.org/jira/browse/SPARK-26356

Let me put it in the doc.

rdblue · 2019-01-14T22:12:24Z