Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25700][SQL] Creates ReadSupport in only Append Mode in Data Source V2 write path #22688

Closed
wants to merge 5 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 10, 2018

What changes were proposed in this pull request?

This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes.

5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path.

This breaks spark.range(1).format("source").write.save("non-existent-path") case since there's no way to read the schema from "non-existent-path".

See also #22009 (comment)
See also #22697
See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html

How was this patch tested?

Unit test and manual tests.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 10, 2018

cc @cloud-fan and @rdblue, this is a more conservative fix but I would prefer to revert it rather then exposing append mode in 2.4. I don't think it's a great idea that read code paths are executed in write path.

@SparkQA

This comment has been minimized.

@cloud-fan
Copy link
Contributor

Do we have users complain about it? In the new write API design, a data source must provide a schema.

I don't think it's practical that people need a write-only data source which can accept data of any schema.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 10, 2018

The point is not write only datasource, @cloud-fan. For instance, spark.range(1).format("source").write.save("non-existent-path"). There's no way to read the schema. There's no complaints so far because this problem appears in the RC only - it works fine in 2.3. I myself faced this problem and surprised read code path is executed in write path.

@HyukjinKwon
Copy link
Member Author

Other datasources such as Parquet, ORC and JDBC won't be adoptable to the current design since we can't read the schema from the information given the code I provided above.

@HyukjinKwon
Copy link
Member Author

Another concern is, it doesn't sound to me straightforward that readsupport is created and executed in write path.

So, at least, we should just read the schema when it's needed (when save is append mode). The problem still exists here if we go ahead in this way. If the target path does not exist in append mode, again, there is no way to read the schema.

If we disallow to create the target path (or target table) and throws an exception, then it might make sense but I think this is a breaking change in save mode.

Another option we should think is to add an option to control it at least or add an interface to the write side.

I hope we can at least take this out of 2.4 and bring it to 3.0 after more discussions.

@cloud-fan
Copy link
Contributor

ah good point. I think the original design of append operator assumes the table already exists, so a schema should be provided. If we treat file path as a table, then append should fail for your case because path does not exist, and we should use CTAS. cc @rdblue for confirmation.

That said, the change here LGTM. We should only get the relation for append mode.

Furthermore, I think in the future we can't simply proxy old SaveMode write APIs to new write APIs, as the behavior can be different. e.g. currently we can write data to a non-existing path with append mode for file sources, but the append operator can not.

I'm not sure this should block 2.4. Data source v2 API is unstable, so breaking changes are allowed, and we won't treat data source v2 bugs as blockers. We should merge this PR to 2.4, but it's not strong enough to fail an RC.

@@ -190,12 +190,13 @@ class DataSourceV2Suite extends QueryTest with SharedSQLContext {

test("simple writable data source") {
// TODO: java implementation.
val writeOnlySource = classOf[SimpleWriteOnlyDataSource]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a new test case?

@HyukjinKwon
Copy link
Member Author

+1 to deal with this as non blocker. I understand data source v2 is under heavy development and unstable but strongly think we should backport .. it breaks a basic operation ..

@SparkQA

This comment has been minimized.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA

This comment has been minimized.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member

viirya commented Oct 11, 2018

Seems the same test failed?

@HyukjinKwon
Copy link
Member Author

Hm, yea, this was passed in my local so I expected this was flaky but seems I should fix.

@HyukjinKwon
Copy link
Member Author

I have no idea why it passes in my local. I fixed the test.

withTempPath { file =>
val cls = classOf[SimpleWriteOnlyDataSource]
val path = file.getCanonicalPath
val df = spark.range(5).select('id as 'i, -'id as 'j)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write path looks requiring two columns:

out.writeBytes(s"${record.getLong(0)},${record.getLong(1)}\n")

df.write.format(cls.getName).option("path", path).mode("ignore").save()
} catch {
case e: SchemaReadAttemptException => fail("Schema read was attempted.", e)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To validate new code path line 250, could you add intercept[SchemaReadAttemptException] and do append, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member

viirya commented Oct 11, 2018

retest this please.

@@ -116,7 +116,6 @@ class SimpleWritableDataSource extends DataSourceV2
schema: StructType,
mode: SaveMode,
options: DataSourceOptions): Optional[BatchWriteSupport] = {
assert(DataType.equalsStructurally(schema.asNullable, this.schema.asNullable))
Copy link
Member

@viirya viirya Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For modes other than Append, I think we still need this assert, don't we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea .. but it's in test code and just sanity check..

@SparkQA

This comment has been minimized.

@viirya
Copy link
Member

viirya commented Oct 11, 2018

retest this please.

@viirya
Copy link
Member

viirya commented Oct 11, 2018

LGTM

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97246 has finished for PR 22688 at commit 2a42253.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you, @HyukjinKwon and @cloud-fan and @viirya .

Merged to master.

@asfgit asfgit closed this in 83e19d5 Oct 11, 2018
@HyukjinKwon
Copy link
Member Author

Thanks all!

asfgit pushed a commit that referenced this pull request Oct 15, 2018
…n Data Source V2

## What changes were proposed in this pull request?

This PR proposes to partially revert 5fef6e3 so that it does make a readsupport and read schema when it writes in branch 2-4 since it's too breaking change.

5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path.

For instance, this breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path".

See also #22009 (comment)
See also #22688
See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html

## How was this patch tested?

Unit test and manual tests.

Closes #22697 from HyukjinKwon/append-revert-2.4.

Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon HyukjinKwon deleted the append-revert-2 branch October 16, 2018 12:41
bavardage pushed a commit to palantir/spark that referenced this pull request Oct 25, 2018
…urce V2 write path

## What changes were proposed in this pull request?

This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes.

apache@5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path.

This breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path".

See also apache#22009 (comment)
See also apache#22697
See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html

## How was this patch tested?

Unit test and manual tests.

Closes apache#22688 from HyukjinKwon/append-revert-2.

Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
asfgit pushed a commit that referenced this pull request Jan 15, 2019
## What changes were proposed in this pull request?

Adjust the batch write API to match the read API refactor after #23086

The doc with high-level ideas:
https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing

Basically it renames `BatchWriteSupportProvider` to `SupportsBatchWrite`, and make it extend `Table`. Renames `WriteSupport` to `Write`. It also cleans up some code as batch API is completed.

This PR also removes the test from #22688 . Now data source must return a table for read/write.

A few notes about future changes:
1. We will create `SupportsStreamingWrite` later for streaming APIs
2. We will create `SupportsBatchReplaceWhere`, `SupportsBatchAppend`, etc. for the new end-user write APIs. I think streaming APIs would remain to use `OutputMode`, and new end-user write APIs will apply to batch only, at least in the near future.
3. We will remove `SaveMode` from data source API: https://issues.apache.org/jira/browse/SPARK-26356

## How was this patch tested?

existing tests

Closes #23208 from cloud-fan/refactor-batch.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…urce V2 write path

## What changes were proposed in this pull request?

This PR proposes to avoid to make a readsupport and read schema when it writes in other save modes.

apache@5fef6e3 happened to create a readsupport in write path, which ended up with reading schema from readsupport at write path.

This breaks `spark.range(1).format("source").write.save("non-existent-path")` case since there's no way to read the schema from "non-existent-path".

See also apache#22009 (comment)
See also apache#22697
See also http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-bug-in-DatasourceV2-td25343.html

## How was this patch tested?

Unit test and manual tests.

Closes apache#22688 from HyukjinKwon/append-revert-2.

Authored-by: hyukjinkwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Adjust the batch write API to match the read API refactor after apache#23086

The doc with high-level ideas:
https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing

Basically it renames `BatchWriteSupportProvider` to `SupportsBatchWrite`, and make it extend `Table`. Renames `WriteSupport` to `Write`. It also cleans up some code as batch API is completed.

This PR also removes the test from apache#22688 . Now data source must return a table for read/write.

A few notes about future changes:
1. We will create `SupportsStreamingWrite` later for streaming APIs
2. We will create `SupportsBatchReplaceWhere`, `SupportsBatchAppend`, etc. for the new end-user write APIs. I think streaming APIs would remain to use `OutputMode`, and new end-user write APIs will apply to batch only, at least in the near future.
3. We will remove `SaveMode` from data source API: https://issues.apache.org/jira/browse/SPARK-26356

## How was this patch tested?

existing tests

Closes apache#23208 from cloud-fan/refactor-batch.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants