[SPARK-25188][SQL] Add WriteConfig to v2 write API. #22190

rdblue · 2018-08-22T21:26:41Z

What changes were proposed in this pull request?

This updates the v2 write path to a similar structure as the v2 read path. Individual writes are configured and tracked using WriteConfig (analogous to ScanConfig) and this config is passed to the methods of WriteSupport that are specific to a single write, like commit and abort.

This new config will be used to communicate overwrite options to data sources that implement new support classes, BatchOverwriteSupport and BatchPartitionOverwriteSupport. The new config could also be used by implementations to get and hold locks to make operations atomic.

Streaming is also updated to use a StreamingWriteConfig. Options that are specific to a write, like schema, output mode, and write options.

How was this patch tested?

This is primarily an API change and should pass existing tests.

rdblue · 2018-08-22T21:35:21Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

@@ -279,10 +277,7 @@ private[kafka010] class KafkaSourceProvider extends DataSourceRegister
    // We convert the options argument from V2 -> Java map -> scala mutable -> scala immutable.
    val producerParams = kafkaParamsForProducer(options.asMap.asScala.toMap)

-    KafkaWriter.validateQuery(


This query validation happens in KafkaStreamingWriteSupport. It was duplicated here and in that class. Now, it happens just once when creating the scan config.

rdblue · 2018-08-22T21:42:04Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/BatchOverwriteSupport.java

+ * <code>$"day" === '2018-08-22'</code>, to remove that data and commit the replacement data at
+ * the same time.
+ */
+public interface BatchOverwriteSupport extends BatchWriteSupport {


This class will be used to create the WriteConfig for idempotent overwrite operations. This would be triggered by an overwrite like this (the API could be different).

df.writeTo("table").overwrite($"day" === "2018-08-22")

That would produce a OverwriteData(source, deleteFilter, query) logical plan, which would result in the exec node calling this to create the write config.

rdblue · 2018-08-22T21:44:51Z

...ore/src/main/java/org/apache/spark/sql/sources/v2/writer/BatchPartitionOverwriteSupport.java

+ * <p>
+ * This is used to implement INSERT OVERWRITE ... PARTITIONS.
+ */
+public interface BatchPartitionOverwriteSupport {


This class will be used to create a WriteConfig that instructs the data source to replace partitions in the existing data with partitions of a dataframe. The logical plan would be DynamicPartitionOverwrite.

rdblue · 2018-08-22T21:46:47Z

...ore/src/main/java/org/apache/spark/sql/sources/v2/writer/BatchPartitionOverwriteSupport.java

+ * <p>
+ * This is used to implement INSERT OVERWRITE ... PARTITIONS.
+ */
+public interface BatchPartitionOverwriteSupport extends BatchWriteSupport {


This class will be used to create a WriteConfig that instructs the data source to replace partitions in the existing data with partitions of a dataframe. The logical plan would be DynamicPartitionOverwrite.

rdblue · 2018-08-22T21:48:09Z

...src/main/scala/org/apache/spark/sql/execution/streaming/sources/MicroBatchWriteSupport.scala


 /**
 * A [[BatchWriteSupport]] used to hook V2 stream writers into a microbatch plan. It implements
 * the non-streaming interface, forwarding the epoch ID determined at construction to a wrapped
 * streaming write support.
 */
-class MicroBatchWritSupport(eppchId: Long, val writeSupport: StreamingWriteSupport)
+class MicroBatchWriteSupport(eppchId: Long, val writeSupport: StreamingWriteSupport)


This fixed a typo in the class name.

rdblue · 2018-08-22T23:19:05Z

This is related to #21308, which adds DeleteSupport. Both BatchOverwriteSupport and DeleteSupport use the same input to remove data (Filter[]) and can reject deletes that don't align with partition boundaries.

rdblue · 2018-08-23T16:28:56Z

Retest this please.

SparkQA · 2018-08-23T19:35:42Z

Test build #95174 has finished for PR 22190 at commit 847300f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-08-23T19:59:39Z

@rxin, @cloud-fan, @jose-torres: this is the update to add WriteConfig. There's one failed test that I think is unrelated, so this is ready for you to have a look. This will probably need to be updated for the current changes under discussion.

rdblue · 2018-08-23T23:48:46Z

Retest this please

SparkQA · 2018-08-24T03:34:38Z

Test build #95188 has finished for PR 22190 at commit 847300f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SPARK-25188: Add WriteConfig to v2 write API.

e3fcc83

rdblue changed the title ~~SPARK-25188: Add WriteConfig to v2 write API.~~ [SPARK-25188][SQL] Add WriteConfig to v2 write API. Aug 22, 2018

rdblue commented Aug 22, 2018

View reviewed changes

SPARK-25188: BatchPartitionOverwriteSupport extends BatchWriteSupport.

6dbc6e0

This comment has been minimized.

Sign in to view

rdblue commented Aug 22, 2018

View reviewed changes

This comment has been minimized.

Sign in to view

SPARK-25188: Fix Javadoc errors.

847300f

rdblue force-pushed the SPARK-25188-add-write-config branch from 37d5087 to 847300f Compare August 23, 2018 17:49

rdblue mentioned this pull request Dec 6, 2018

[SPARK-25530][SQL] data source v2 API refactor (batch write) #23208

Closed

rdblue closed this Feb 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25188][SQL] Add WriteConfig to v2 write API. #22190

[SPARK-25188][SQL] Add WriteConfig to v2 write API. #22190

rdblue commented Aug 22, 2018

rdblue Aug 22, 2018

rdblue Aug 22, 2018

rdblue Aug 22, 2018

This comment has been minimized.

rdblue Aug 22, 2018

rdblue Aug 22, 2018

This comment has been minimized.

rdblue commented Aug 22, 2018

rdblue commented Aug 23, 2018

This comment has been minimized.

This comment has been minimized.

SparkQA commented Aug 23, 2018

rdblue commented Aug 23, 2018

rdblue commented Aug 23, 2018

SparkQA commented Aug 24, 2018

[SPARK-25188][SQL] Add WriteConfig to v2 write API. #22190

[SPARK-25188][SQL] Add WriteConfig to v2 write API. #22190

Conversation

rdblue commented Aug 22, 2018

What changes were proposed in this pull request?

How was this patch tested?

rdblue Aug 22, 2018

Choose a reason for hiding this comment

rdblue Aug 22, 2018

Choose a reason for hiding this comment

rdblue Aug 22, 2018

Choose a reason for hiding this comment

This comment has been minimized.

rdblue Aug 22, 2018

Choose a reason for hiding this comment

rdblue Aug 22, 2018

Choose a reason for hiding this comment

This comment has been minimized.

rdblue commented Aug 22, 2018

rdblue commented Aug 23, 2018

This comment has been minimized.

This comment has been minimized.

SparkQA commented Aug 23, 2018

rdblue commented Aug 23, 2018

rdblue commented Aug 23, 2018

SparkQA commented Aug 24, 2018