-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639
Conversation
Test build #101631 has finished for PR 23639 at commit
|
retest this please. |
Test build #101637 has finished for PR 23639 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
Outdated
Show resolved
Hide resolved
This comment has been minimized.
This comment has been minimized.
@@ -156,7 +156,7 @@ trait FileFormat { | |||
* Returns whether this format supports the given [[DataType]] in read/write path. | |||
* By default all data types are supported. | |||
*/ | |||
def supportDataType(dataType: DataType, isReadPath: Boolean): Boolean = true | |||
def supportsDataType(dataType: DataType): Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but we also have supportBatch
in this file. If this is just a cosmetic issue, I prefer to keep the existing consistency in this single class. Also, we still use other instances like supportCodegen
in other classes, too.
- rename to supportsDataType, which is more consistent with other data source API(e.g. SupportsBatchRead, SupportsBatchWrite).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@HyukjinKwon @dongjoon-hyun @cloud-fan thanks for the suggestions. I have updated the code and PR description. |
}.getMessage | ||
assert(msg.toLowerCase(Locale.ROOT) | ||
.contains(s"$format data source does not support null data type.")) | ||
withSQLConf(SQLConf.USE_V1_SOURCE_READER_LIST.key -> "orc") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intention, but DSv2 ORC should have this test coverage.
In general, this is a general issue for maintaining DSv1 and v2 test coverage, could you update this test suite to provide both DSv1 and DSv2 test coverage instead of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there is no such validation in V2. I promise I will implement it in this PR #23601 (or may a separated one) recently. Is that OK to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we revisit this in 3.0.0 timeframe, it sounds okay. Then, can we remove this line in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the corresponding check is not implemented in orc v2 source yet, if we don't disable v2 here, we will see runtime errors. Shall we leave a TODO here and say this check should be done in orc v2 source as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding TODO
with Spark JIRA issue id.
Test build #101685 has finished for PR 23639 at commit
|
Test build #101702 has finished for PR 23639 at commit
|
Retest this please |
Test build #101711 has finished for PR 23639 at commit
|
}.getMessage | ||
assert(msg.toLowerCase(Locale.ROOT) | ||
.contains(s"$format data source does not support null data type.")) | ||
// TODO(SPARK-26716): support data type validating in V2 data source, and test V2 as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur? @gengliangwang . We need to create a new JIRA issue ID here instead of this PR.
If we use this JIRA ID, it will be closed as Resolved
status and nobody is going to take a look at that later.
Test build #101725 has finished for PR 23639 at commit
|
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you, all!
We don't need to document this about the 2. behaviour change? |
I think it's very unlikely users will specify the schema as null type when reading orc files, but it's safer to add one anyway. |
ok, thanks for the check. |
…methods and override toString method in Avro ## What changes were proposed in this pull request? In #23639, the API `supportDataType` is refactored. We should also remove the method `verifyWriteSchema` and `verifyReadSchema` in `DataSourceUtils`. Since the error message use `FileFormat.toString` to specify the data source naming, this PR also overriding the `toString` method in `AvroFileFormat`. ## How was this patch tested? Unit test. Closes #23699 from gengliangwang/SPARK-26716-followup. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes apache#23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…methods and override toString method in Avro ## What changes were proposed in this pull request? In apache#23639, the API `supportDataType` is refactored. We should also remove the method `verifyWriteSchema` and `verifyReadSchema` in `DataSourceUtils`. Since the error message use `FileFormat.toString` to specify the data source naming, this PR also overriding the `toString` method in `AvroFileFormat`. ## How was this patch tested? Unit test. Closes apache#23699 from gengliangwang/SPARK-26716-followup. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Remove parameter
isReadPath
. The supported types of read/write should be the same.Disallow reading
NullType
for ORC data source. In [SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667 and [SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389, it was supposed that ORC supports readingNullType
, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't supportNullType
.How was this patch tested?
Unit tset