[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688

HyukjinKwon · 2018-10-10T10:47:10Z

What changes were proposed in this pull request?

This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes.

5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path.

This breaks spark.range(1).format("source").write.save("non-existent-path") case since there's no way to read the schema from "non-existent-path".

See also #22009 (comment)
See also #22697
See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html

How was this patch tested?

Unit test and manual tests.

HyukjinKwon · 2018-10-10T10:50:56Z

cc @cloud-fan and @rdblue, this is a more conservative fix but I would prefer to revert it rather then exposing append mode in 2.4. I don't think it's a great idea that read code paths are executed in write path.

cloud-fan · 2018-10-10T14:42:29Z

Do we have users complain about it? In the new write API design, a data source must provide a schema.

I don't think it's practical that people need a write-only data source which can accept data of any schema.

HyukjinKwon · 2018-10-10T15:49:43Z

The point is not write only datasource, @cloud-fan. For instance, spark.range(1).format("source").write.save("non-existent-path"). There's no way to read the schema. There's no complaints so far because this problem appears in the RC only - it works fine in 2.3. I myself faced this problem and surprised read code path is executed in write path.

HyukjinKwon · 2018-10-10T15:56:02Z

Other datasources such as Parquet, ORC and JDBC won't be adoptable to the current design since we can't read the schema from the information given the code I provided above.

HyukjinKwon · 2018-10-10T16:12:20Z

Another concern is, it doesn't sound to me straightforward that readsupport is created and executed in write path.

So, at least, we should just read the schema when it's needed (when save is append mode). The problem still exists here if we go ahead in this way. If the target path does not exist in append mode, again, there is no way to read the schema.

If we disallow to create the target path (or target table) and throws an exception, then it might make sense but I think this is a breaking change in save mode.

Another option we should think is to add an option to control it at least or add an interface to the write side.

I hope we can at least take this out of 2.4 and bring it to 3.0 after more discussions.

cloud-fan · 2018-10-10T16:35:35Z

ah good point. I think the original design of append operator assumes the table already exists, so a schema should be provided. If we treat file path as a table, then append should fail for your case because path does not exist, and we should use CTAS. cc @rdblue for confirmation.

That said, the change here LGTM. We should only get the relation for append mode.

Furthermore, I think in the future we can't simply proxy old SaveMode write APIs to new write APIs, as the behavior can be different. e.g. currently we can write data to a non-existing path with append mode for file sources, but the append operator can not.

I'm not sure this should block 2.4. Data source v2 API is unstable, so breaking changes are allowed, and we won't treat data source v2 bugs as blockers. We should merge this PR to 2.4, but it's not strong enough to fail an RC.

cloud-fan · 2018-10-10T16:37:20Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

@@ -190,12 +190,13 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {

  test("simple writable data source") {
    // TODO: java implementation.
+    val writeOnlySource = classOf[SimpleWriteOnlyDataSource]


can we create a new test case?

HyukjinKwon · 2018-10-10T16:43:18Z

+1 to deal with this as non blocker. I understand data source v2 is under heavy development and unstable but strongly think we should backport .. it breaks a basic operation ..

dongjoon-hyun · 2018-10-10T22:55:16Z

Retest this please.

HyukjinKwon · 2018-10-11T01:41:55Z

retest this please

viirya · 2018-10-11T04:33:12Z

Seems the same test failed?

HyukjinKwon · 2018-10-11T05:02:34Z

Hm, yea, this was passed in my local so I expected this was flaky but seems I should fix.

HyukjinKwon · 2018-10-11T05:07:13Z

I have no idea why it passes in my local. I fixed the test.

HyukjinKwon · 2018-10-11T05:11:34Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

+    withTempPath { file =>
+      val cls = classOf[SimpleWriteOnlyDataSource]
+      val path = file.getCanonicalPath
+      val df = spark.range(5).select('id as 'i, -'id as 'j)


The write path looks requiring two columns:

spark/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/SimpleWritableDataSource.scala

Line 214 in e06da95

out.writeBytes(s"${record.getLong(0)},${record.getLong(1)}\n")

dongjoon-hyun · 2018-10-11T06:21:46Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2Suite.scala

+        df.write.format(cls.getName).option("path", path).mode("ignore").save()
+      } catch {
+        case e: SchemaReadAttemptException => fail("Schema read was attempted.", e)
+      }


To validate new code path line 250, could you add intercept[SchemaReadAttemptException] and do append, too?

viirya · 2018-10-11T07:10:49Z

retest this please.

viirya · 2018-10-11T07:13:55Z

sql/core/src/test/scala/org/apache/spark/sql/sources/v2/SimpleWritableDataSource.scala

@@ -116,7 +116,6 @@ class SimpleWritableDataSource extends DataSourceV2
      schema: StructType,
      mode: SaveMode,
      options: DataSourceOptions): Optional[BatchWriteSupport] = {
-    assert(DataType.equalsStructurally(schema.asNullable, this.schema.asNullable))


For modes other than Append, I think we still need this assert, don't we?

Yea .. but it's in test code and just sanity check..

viirya · 2018-10-11T08:48:13Z

retest this please.

viirya · 2018-10-11T08:49:18Z

LGTM

SparkQA · 2018-10-11T12:14:37Z

Test build #97246 has finished for PR 22688 at commit 2a42253.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-11T16:35:12Z

Thank you, @HyukjinKwon and @cloud-fan and @viirya .

Merged to master.

HyukjinKwon · 2018-10-11T17:11:43Z

Thanks all!

…n Data Source V2 ## What changes were proposed in this pull request? This PR proposes to partially revert 5fef6e3 so that it does make a readsupport and read schema when it writes in branch 2-4 since it's too breaking change. 5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. For instance, this breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also #22009 (comment) See also #22688 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes #22697 from HyukjinKwon/append-revert-2.4. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…urce V2 write path ## What changes were proposed in this pull request? This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes. apache@5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. This breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also apache#22009 (comment) See also apache#22697 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes apache#22688 from HyukjinKwon/append-revert-2. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? Adjust the batch write API to match the read API refactor after #23086 The doc with high-level ideas: https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing Basically it renames `BatchWriteSupportProvider` to `SupportsBatchWrite`, and make it extend `Table`. Renames `WriteSupport` to `Write`. It also cleans up some code as batch API is completed. This PR also removes the test from #22688 . Now data source must return a table for read/write. A few notes about future changes: 1. We will create `SupportsStreamingWrite` later for streaming APIs 2. We will create `SupportsBatchReplaceWhere`, `SupportsBatchAppend`, etc. for the new end-user write APIs. I think streaming APIs would remain to use `OutputMode`, and new end-user write APIs will apply to batch only, at least in the near future. 3. We will remove `SaveMode` from data source API: https://issues.apache.org/jira/browse/SPARK-26356 ## How was this patch tested? existing tests Closes #23208 from cloud-fan/refactor-batch. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…urce V2 write path ## What changes were proposed in this pull request? This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes. apache@5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. This breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also apache#22009 (comment) See also apache#22697 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes apache#22688 from HyukjinKwon/append-revert-2. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

## What changes were proposed in this pull request? Adjust the batch write API to match the read API refactor after apache#23086 The doc with high-level ideas: https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing Basically it renames `BatchWriteSupportProvider` to `SupportsBatchWrite`, and make it extend `Table`. Renames `WriteSupport` to `Write`. It also cleans up some code as batch API is completed. This PR also removes the test from apache#22688 . Now data source must return a table for read/write. A few notes about future changes: 1. We will create `SupportsStreamingWrite` later for streaming APIs 2. We will create `SupportsBatchReplaceWhere`, `SupportsBatchAppend`, etc. for the new end-user write APIs. I think streaming APIs would remain to use `OutputMode`, and new end-user write APIs will apply to batch only, at least in the near future. 3. We will remove `SaveMode` from data source API: https://issues.apache.org/jira/browse/SPARK-26356 ## How was this patch tested? existing tests Closes apache#23208 from cloud-fan/refactor-batch. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

This comment has been minimized.

Sign in to view

cloud-fan reviewed Oct 10, 2018

View reviewed changes

This comment has been minimized.

Sign in to view

HyukjinKwon added 2 commits October 11, 2018 09:37

Creates ReadSupport in only Append Mode in Data Source V2 write path

553fea1

add a separate test

b28afe2

This comment has been minimized.

Sign in to view

Fix the test

fa69f9c

HyukjinKwon force-pushed the append-revert-2 branch from 9377bc3 to fa69f9c Compare October 11, 2018 05:08

nit: use different column name

ded852c

HyukjinKwon commented Oct 11, 2018

View reviewed changes

dongjoon-hyun reviewed Oct 11, 2018

View reviewed changes

Add a test for append mode as well

2a42253

This comment has been minimized.

Sign in to view

dongjoon-hyun approved these changes Oct 11, 2018

View reviewed changes

viirya reviewed Oct 11, 2018

View reviewed changes

This comment has been minimized.

Sign in to view

HyukjinKwon mentioned this pull request Oct 11, 2018

[SPARK-25700][SQL][BRANCH-2.4] Partially revert append mode support in Data Source V2 #22697

Closed

asfgit closed this in 83e19d5 Oct 11, 2018

HyukjinKwon deleted the append-revert-2 branch October 16, 2018 12:41

cloud-fan mentioned this pull request Dec 3, 2018

[SPARK-25530][SQL] data source v2 API refactor (batch write) #23208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688

[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

This comment has been minimized.

cloud-fan commented Oct 10, 2018

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

HyukjinKwon commented Oct 10, 2018

HyukjinKwon commented Oct 10, 2018

cloud-fan commented Oct 10, 2018

cloud-fan Oct 10, 2018

HyukjinKwon commented Oct 10, 2018

This comment has been minimized.

dongjoon-hyun commented Oct 10, 2018

This comment has been minimized.

HyukjinKwon commented Oct 11, 2018

This comment has been minimized.

viirya commented Oct 11, 2018

HyukjinKwon commented Oct 11, 2018

HyukjinKwon commented Oct 11, 2018

HyukjinKwon Oct 11, 2018

dongjoon-hyun Oct 11, 2018

HyukjinKwon Oct 11, 2018

This comment has been minimized.

This comment has been minimized.

viirya commented Oct 11, 2018

viirya Oct 11, 2018 •

edited

Loading

HyukjinKwon Oct 11, 2018

This comment has been minimized.

viirya commented Oct 11, 2018

viirya commented Oct 11, 2018

SparkQA commented Oct 11, 2018

dongjoon-hyun commented Oct 11, 2018

HyukjinKwon commented Oct 11, 2018

[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688

[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688

Conversation

HyukjinKwon commented Oct 10, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Oct 10, 2018 • edited Loading

This comment has been minimized.

cloud-fan commented Oct 10, 2018

HyukjinKwon commented Oct 10, 2018 • edited Loading

HyukjinKwon commented Oct 10, 2018

HyukjinKwon commented Oct 10, 2018

cloud-fan commented Oct 10, 2018

cloud-fan Oct 10, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Oct 10, 2018

This comment has been minimized.

dongjoon-hyun commented Oct 10, 2018

This comment has been minimized.

HyukjinKwon commented Oct 11, 2018

This comment has been minimized.

viirya commented Oct 11, 2018

HyukjinKwon commented Oct 11, 2018

HyukjinKwon commented Oct 11, 2018

HyukjinKwon Oct 11, 2018

Choose a reason for hiding this comment

dongjoon-hyun Oct 11, 2018

Choose a reason for hiding this comment

HyukjinKwon Oct 11, 2018

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

viirya commented Oct 11, 2018

viirya Oct 11, 2018 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Oct 11, 2018

Choose a reason for hiding this comment

This comment has been minimized.

viirya commented Oct 11, 2018

viirya commented Oct 11, 2018

SparkQA commented Oct 11, 2018

dongjoon-hyun commented Oct 11, 2018

HyukjinKwon commented Oct 11, 2018

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

HyukjinKwon commented Oct 10, 2018 •

edited

Loading

viirya Oct 11, 2018 •

edited

Loading