Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

Closed
wants to merge 4 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jan 24, 2019

What changes were proposed in this pull request?

  1. Remove parameter isReadPath. The supported types of read/write should be the same.

  2. Disallow reading NullType for ORC data source. In [SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667 and [SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389, it was supposed that ORC supports reading NullType, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support NullType.

How was this patch tested?

Unit tset

@gengliangwang
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Jan 24, 2019

Test build #101631 has finished for PR 23639 at commit c6ab192.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 24, 2019

Test build #101637 has finished for PR 23639 at commit c6ab192.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon

This comment has been minimized.

@@ -156,7 +156,7 @@ trait FileFormat {
* Returns whether this format supports the given [[DataType]] in read/write path.
* By default all data types are supported.
*/
def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
def supportsDataType(dataType: DataType): Boolean = true
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but we also have supportBatch in this file. If this is just a cosmetic issue, I prefer to keep the existing consistency in this single class. Also, we still use other instances like supportCodegen in other classes, too.

  1. rename to supportsDataType, which is more consistent with other data source API(e.g. SupportsBatchRead, SupportsBatchWrite).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@gengliangwang
Copy link
Member Author

@HyukjinKwon @dongjoon-hyun @cloud-fan thanks for the suggestions. I have updated the code and PR description.

}.getMessage
assert(msg.toLowerCase(Locale.ROOT)
.contains(s"$format data source does not support null data type."))
withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> "orc") {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intention, but DSv2 ORC should have this test coverage.
In general, this is a general issue for maintaining DSv1 and v2 test coverage, could you update this test suite to provide both DSv1 and DSv2 test coverage instead of this line?

Copy link
Member Author

@gengliangwang gengliangwang Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is no such validation in V2. I promise I will implement it in this PR #23601 (or may a separated one) recently. Is that OK to you?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we revisit this in 3.0.0 timeframe, it sounds okay. Then, can we remove this line in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the corresponding check is not implemented in orc v2 source yet, if we don't disable v2 here, we will see runtime errors. Shall we leave a TODO here and say this check should be done in orc v2 source as well?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding TODO with Spark JIRA issue id.

@SparkQA
Copy link

SparkQA commented Jan 25, 2019

Test build #101685 has finished for PR 23639 at commit 1cc2b34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 26, 2019

Test build #101702 has finished for PR 23639 at commit 2552ba6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Jan 26, 2019

Test build #101711 has finished for PR 23639 at commit 2552ba6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}.getMessage
assert(msg.toLowerCase(Locale.ROOT)
.contains(s"$format data source does not support null data type."))
// TODO(SPARK-26716): support data type validating in V2 data source, and test V2 as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur? @gengliangwang . We need to create a new JIRA issue ID here instead of this PR.
If we use this JIRA ID, it will be closed as Resolved status and nobody is going to take a look at that later.

@SparkQA
Copy link

SparkQA commented Jan 27, 2019

Test build #101725 has finished for PR 23639 at commit 442bb2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
Thank you, all!

@maropu
Copy link
Member

maropu commented Jan 28, 2019

We don't need to document this about the 2. behaviour change?

@cloud-fan
Copy link
Contributor

I think it's very unlikely users will specify the schema as null type when reading orc files, but it's safer to add one anyway.

@maropu
Copy link
Member

maropu commented Jan 28, 2019

ok, thanks for the check.

cloud-fan pushed a commit that referenced this pull request Jan 31, 2019
…methods and override toString method in Avro

## What changes were proposed in this pull request?

In #23639, the API `supportDataType` is refactored. We should also remove the method `verifyWriteSchema` and `verifyReadSchema` in `DataSourceUtils`.

Since the error message use `FileFormat.toString` to specify the data source naming,  this PR also overriding the `toString` method in `AvroFileFormat`.

## How was this patch tested?

Unit test.

Closes #23699 from gengliangwang/SPARK-26716-followup.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…ld be consistent

## What changes were proposed in this pull request?

1. Remove parameter `isReadPath`. The supported types of read/write should be the same.

2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`.

## How was this patch tested?

Unit tset

Closes apache#23639 from gengliangwang/supportDataType.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…methods and override toString method in Avro

## What changes were proposed in this pull request?

In apache#23639, the API `supportDataType` is refactored. We should also remove the method `verifyWriteSchema` and `verifyReadSchema` in `DataSourceUtils`.

Since the error message use `FileFormat.toString` to specify the data source naming,  this PR also overriding the `toString` method in `AvroFileFormat`.

## How was this patch tested?

Unit test.

Closes apache#23699 from gengliangwang/SPARK-26716-followup.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants