[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

gengliangwang · 2019-01-24T14:08:11Z

What changes were proposed in this pull request?

Remove parameter isReadPath. The supported types of read/write should be the same.
Disallow reading NullType for ORC data source. In [SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667 and [SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389, it was supposed that ORC supports reading NullType, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support NullType.

How was this patch tested?

Unit tset

gengliangwang · 2019-01-24T14:08:38Z

@maropu @cloud-fan @dongjoon-hyun @HyukjinKwon

SparkQA · 2019-01-24T15:00:52Z

Test build #101631 has finished for PR 23639 at commit c6ab192.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-01-24T15:28:00Z

retest this please.

SparkQA · 2019-01-24T20:01:02Z

Test build #101637 has finished for PR 23639 at commit c6ab192.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala

dongjoon-hyun · 2019-01-25T05:19:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormat.scala

@@ -156,7 +156,7 @@ trait FileFormat {
   * Returns whether this format supports the given [[DataType]] in read/write path.
   * By default all data types are supported.
   */
-  def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true
+  def supportsDataType(dataType: DataType): Boolean = true


Sorry, but we also have supportBatch in this file. If this is just a cosmetic issue, I prefer to keep the existing consistency in this single class. Also, we still use other instances like supportCodegen in other classes, too.

rename to supportsDataType, which is more consistent with other data source API(e.g. SupportsBatchRead, SupportsBatchWrite).

gengliangwang · 2019-01-25T17:19:30Z

@HyukjinKwon @dongjoon-hyun @cloud-fan thanks for the suggestions. I have updated the code and PR description.

dongjoon-hyun · 2019-01-25T17:34:55Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

-        }.getMessage
-        assert(msg.toLowerCase(Locale.ROOT)
-          .contains(s"$format data source does not support null data type."))
+    withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> "orc") {


I understand the intention, but DSv2 ORC should have this test coverage.
In general, this is a general issue for maintaining DSv1 and v2 test coverage, could you update this test suite to provide both DSv1 and DSv2 test coverage instead of this line?

Currently, there is no such validation in V2. I promise I will implement it in this PR #23601 (or may a separated one) recently. Is that OK to you?

If we revisit this in 3.0.0 timeframe, it sounds okay. Then, can we remove this line in this PR?

AFAIK the corresponding check is not implemented in orc v2 source yet, if we don't disable v2 here, we will see runtime errors. Shall we leave a TODO here and say this check should be done in orc v2 source as well?

+1 for adding TODO with Spark JIRA issue id.

SparkQA · 2019-01-25T21:30:49Z

Test build #101685 has finished for PR 23639 at commit 1cc2b34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-26T08:05:01Z

Test build #101702 has finished for PR 23639 at commit 2552ba6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-01-26T08:16:53Z

Retest this please

SparkQA · 2019-01-26T12:30:29Z

Test build #101711 has finished for PR 23639 at commit 2552ba6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-01-26T22:25:35Z

sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala

-        }.getMessage
-        assert(msg.toLowerCase(Locale.ROOT)
-          .contains(s"$format data source does not support null data type."))
+    // TODO(SPARK-26716): support data type validating in V2 data source, and test V2 as well.


Ur? @gengliangwang . We need to create a new JIRA issue ID here instead of this PR.
If we use this JIRA ID, it will be closed as Resolved status and nobody is going to take a look at that later.

SparkQA · 2019-01-27T12:46:43Z

Test build #101725 has finished for PR 23639 at commit 442bb2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you, all!

maropu · 2019-01-28T01:12:35Z

We don't need to document this about the 2. behaviour change?

cloud-fan · 2019-01-28T03:41:13Z

I think it's very unlikely users will specify the schema as null type when reading orc files, but it's safer to add one anyway.

maropu · 2019-01-28T03:46:26Z

ok, thanks for the check.

…methods and override toString method in Avro ## What changes were proposed in this pull request? In #23639, the API `supportDataType` is refactored. We should also remove the method `verifyWriteSchema` and `verifyReadSchema` in `DataSourceUtils`. Since the error message use `FileFormat.toString` to specify the data source naming, this PR also overriding the `toString` method in `AvroFileFormat`. ## How was this patch tested? Unit test. Closes #23699 from gengliangwang/SPARK-26716-followup. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes apache#23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…methods and override toString method in Avro ## What changes were proposed in this pull request? In apache#23639, the API `supportDataType` is refactored. We should also remove the method `verifyWriteSchema` and `verifyReadSchema` in `DataSourceUtils`. Since the error message use `FileFormat.toString` to specify the data source naming, this PR also overriding the `toString` method in `AvroFileFormat`. ## How was this patch tested? Unit test. Closes apache#23699 from gengliangwang/SPARK-26716-followup. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

refactor

c6ab192

HyukjinKwon reviewed Jan 25, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

dongjoon-hyun reviewed Jan 25, 2019

View reviewed changes

address comments

1cc2b34

dongjoon-hyun reviewed Jan 25, 2019

View reviewed changes

add TODO comment

2552ba6

dongjoon-hyun reviewed Jan 26, 2019

View reviewed changes

update jira id

442bb2b

HyukjinKwon approved these changes Jan 27, 2019

View reviewed changes

srowen reviewed Jan 27, 2019

View reviewed changes

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala Show resolved Hide resolved

dongjoon-hyun approved these changes Jan 27, 2019

View reviewed changes

dongjoon-hyun mentioned this pull request Jan 27, 2019

[SPARK-26673][SQL] File source V2 writes: create framework and migrate ORC #23601

Closed

dongjoon-hyun closed this in 36a2e63 Jan 27, 2019

gengliangwang mentioned this pull request Jan 30, 2019

[SPARK-26716][SPARK-26765][FOLLOWUP][SQL] Clean up schema validation methods and override toString method in Avro #23699

Closed

gengliangwang mentioned this pull request Feb 14, 2019

[SPARK-26744][SQL]Support schema validation in FileDataSourceV2 framework #23714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

gengliangwang commented Jan 24, 2019 •

edited

Loading

gengliangwang commented Jan 24, 2019

SparkQA commented Jan 24, 2019

gengliangwang commented Jan 24, 2019

SparkQA commented Jan 24, 2019

This comment has been minimized.

dongjoon-hyun Jan 25, 2019 •

edited

Loading

cloud-fan Jan 25, 2019

gengliangwang commented Jan 25, 2019

dongjoon-hyun Jan 25, 2019 •

edited

Loading

gengliangwang Jan 25, 2019 •

edited

Loading

dongjoon-hyun Jan 25, 2019 •

edited

Loading

cloud-fan Jan 25, 2019

dongjoon-hyun Jan 25, 2019 •

edited

Loading

SparkQA commented Jan 25, 2019

SparkQA commented Jan 26, 2019

dongjoon-hyun commented Jan 26, 2019

SparkQA commented Jan 26, 2019

dongjoon-hyun Jan 26, 2019

SparkQA commented Jan 27, 2019

dongjoon-hyun left a comment

maropu commented Jan 28, 2019

cloud-fan commented Jan 28, 2019

maropu commented Jan 28, 2019

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

Conversation

gengliangwang commented Jan 24, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

gengliangwang commented Jan 24, 2019

SparkQA commented Jan 24, 2019

gengliangwang commented Jan 24, 2019

SparkQA commented Jan 24, 2019

This comment has been minimized.

dongjoon-hyun Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

cloud-fan Jan 25, 2019

Choose a reason for hiding this comment

gengliangwang commented Jan 25, 2019

dongjoon-hyun Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

gengliangwang Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

cloud-fan Jan 25, 2019

Choose a reason for hiding this comment

dongjoon-hyun Jan 25, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Jan 25, 2019

SparkQA commented Jan 26, 2019

dongjoon-hyun commented Jan 26, 2019

SparkQA commented Jan 26, 2019

dongjoon-hyun Jan 26, 2019

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

maropu commented Jan 28, 2019

cloud-fan commented Jan 28, 2019

maropu commented Jan 28, 2019

gengliangwang commented Jan 24, 2019 •

edited

Loading

dongjoon-hyun Jan 25, 2019 •

edited

Loading

dongjoon-hyun Jan 25, 2019 •

edited

Loading

gengliangwang Jan 25, 2019 •

edited

Loading

dongjoon-hyun Jan 25, 2019 •

edited

Loading

dongjoon-hyun Jan 25, 2019 •

edited

Loading