-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688
Conversation
cc @cloud-fan and @rdblue, this is a more conservative fix but I would prefer to revert it rather then exposing append mode in 2.4. I don't think it's a great idea that read code paths are executed in write path. |
This comment has been minimized.
This comment has been minimized.
Do we have users complain about it? In the new write API design, a data source must provide a schema. I don't think it's practical that people need a write-only data source which can accept data of any schema. |
The point is not write only datasource, @cloud-fan. For instance, |
Other datasources such as Parquet, ORC and JDBC won't be adoptable to the current design since we can't read the schema from the information given the code I provided above. |
Another concern is, it doesn't sound to me straightforward that readsupport is created and executed in write path. So, at least, we should just read the schema when it's needed (when save is append mode). The problem still exists here if we go ahead in this way. If the target path does not exist in append mode, again, there is no way to read the schema. If we disallow to create the target path (or target table) and throws an exception, then it might make sense but I think this is a breaking change in save mode. Another option we should think is to add an option to control it at least or add an interface to the write side. I hope we can at least take this out of 2.4 and bring it to 3.0 after more discussions. |
ah good point. I think the original design of append operator assumes the table already exists, so a schema should be provided. If we treat file path as a table, then append should fail for your case because path does not exist, and we should use CTAS. cc @rdblue for confirmation. That said, the change here LGTM. We should only get the relation for append mode. Furthermore, I think in the future we can't simply proxy old I'm not sure this should block 2.4. Data source v2 API is unstable, so breaking changes are allowed, and we won't treat data source v2 bugs as blockers. We should merge this PR to 2.4, but it's not strong enough to fail an RC. |
@@ -190,12 +190,13 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext { | |||
|
|||
test("simple writable data source") { | |||
// TODO: java implementation. | |||
val writeOnlySource = classOf[SimpleWriteOnlyDataSource] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create a new test case?
+1 to deal with this as non blocker. I understand data source v2 is under heavy development and unstable but strongly think we should backport .. it breaks a basic operation .. |
This comment has been minimized.
This comment has been minimized.
Retest this please. |
This comment has been minimized.
This comment has been minimized.
retest this please |
This comment has been minimized.
This comment has been minimized.
Seems the same test failed? |
Hm, yea, this was passed in my local so I expected this was flaky but seems I should fix. |
I have no idea why it passes in my local. I fixed the test. |
9377bc3
to
fa69f9c
Compare
withTempPath { file => | ||
val cls = classOf[SimpleWriteOnlyDataSource] | ||
val path = file.getCanonicalPath | ||
val df = spark.range(5).select('id as 'i, -'id as 'j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write path looks requiring two columns:
spark/sql/core/src/test/scala/org/apache/spark/sql/sources/v2/SimpleWritableDataSource.scala
Line 214 in e06da95
out.writeBytes(s"${record.getLong(0)},${record.getLong(1)}\n") |
df.write.format(cls.getName).option("path", path).mode("ignore").save() | ||
} catch { | ||
case e: SchemaReadAttemptException => fail("Schema read was attempted.", e) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To validate new code path line 250, could you add intercept[SchemaReadAttemptException]
and do append
, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
retest this please. |
@@ -116,7 +116,6 @@ class SimpleWritableDataSource extends DataSourceV2 | |||
schema: StructType, | |||
mode: SaveMode, | |||
options: DataSourceOptions): Optional[BatchWriteSupport] = { | |||
assert(DataType.equalsStructurally(schema.asNullable, this.schema.asNullable)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For modes other than Append, I think we still need this assert, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea .. but it's in test code and just sanity check..
This comment has been minimized.
This comment has been minimized.
retest this please. |
LGTM |
Test build #97246 has finished for PR 22688 at commit
|
Thank you, @HyukjinKwon and @cloud-fan and @viirya . Merged to master. |
Thanks all! |
…n Data Source V2 ## What changes were proposed in this pull request? This PR proposes to partially revert 5fef6e3 so that it does make a readsupport and read schema when it writes in branch 2-4 since it's too breaking change. 5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. For instance, this breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also #22009 (comment) See also #22688 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes #22697 from HyukjinKwon/append-revert-2.4. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…urce V2 write path ## What changes were proposed in this pull request? This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes. apache@5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. This breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also apache#22009 (comment) See also apache#22697 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes apache#22688 from HyukjinKwon/append-revert-2. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request? Adjust the batch write API to match the read API refactor after #23086 The doc with high-level ideas: https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing Basically it renames `BatchWriteSupportProvider` to `SupportsBatchWrite`, and make it extend `Table`. Renames `WriteSupport` to `Write`. It also cleans up some code as batch API is completed. This PR also removes the test from #22688 . Now data source must return a table for read/write. A few notes about future changes: 1. We will create `SupportsStreamingWrite` later for streaming APIs 2. We will create `SupportsBatchReplaceWhere`, `SupportsBatchAppend`, etc. for the new end-user write APIs. I think streaming APIs would remain to use `OutputMode`, and new end-user write APIs will apply to batch only, at least in the near future. 3. We will remove `SaveMode` from data source API: https://issues.apache.org/jira/browse/SPARK-26356 ## How was this patch tested? existing tests Closes #23208 from cloud-fan/refactor-batch. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…urce V2 write path ## What changes were proposed in this pull request? This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes. apache@5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path. This breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path". See also apache#22009 (comment) See also apache#22697 See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html ## How was this patch tested? Unit test and manual tests. Closes apache#22688 from HyukjinKwon/append-revert-2. Authored-by: hyukjinkwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request? Adjust the batch write API to match the read API refactor after apache#23086 The doc with high-level ideas: https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing Basically it renames `BatchWriteSupportProvider` to `SupportsBatchWrite`, and make it extend `Table`. Renames `WriteSupport` to `Write`. It also cleans up some code as batch API is completed. This PR also removes the test from apache#22688 . Now data source must return a table for read/write. A few notes about future changes: 1. We will create `SupportsStreamingWrite` later for streaming APIs 2. We will create `SupportsBatchReplaceWhere`, `SupportsBatchAppend`, etc. for the new end-user write APIs. I think streaming APIs would remain to use `OutputMode`, and new end-user write APIs will apply to batch only, at least in the near future. 3. We will remove `SaveMode` from data source API: https://issues.apache.org/jira/browse/SPARK-26356 ## How was this patch tested? existing tests Closes apache#23208 from cloud-fan/refactor-batch. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes.
5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path.
This breaks
spark.range(1).format("source").write.save("non-existent-path")
case since there's no way to read the schema from "non-existent-path".See also #22009 (comment)
See also #22697
See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html
How was this patch tested?
Unit test and manual tests.