-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value #30824
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
} else if (t == DataTypes.BinaryType) { | ||
col.putByteArray(0, row.getBinary(fieldIdx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? If there is no case for BinaryType, seems there is also no final else
block, so it just leaves as unpopulate previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add else
block, and throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add
else
block, and throw an exception?
Updated and add UT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? If there is no case for BinaryType, seems there is also no final
else
block, so it just leaves as unpopulate previously?
It looks like it does, so that value is empty
Test build #132958 has finished for PR 30824 at commit
|
@@ -3745,6 +3745,21 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark | |||
} | |||
} | |||
} | |||
|
|||
test("SPARK-33593: Parquet vector reader incorrect with binary partition value") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case in f7d2143#diff-8e4e74b1869fecc81b75575bed45b183e79cbaa97578f42c51984c9d89ece45aR694 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case in f7d2143#diff-8e4e74b1869fecc81b75575bed45b183e79cbaa97578f42c51984c9d89ece45aR694 too?
Thanks for your advise, I am looking where to add UT for this since not familiar with this part.
|
||
test("SPARK-33593: Parquet vector reader incorrect with binary partition value") { | ||
Seq(true).foreach(tag => { | ||
withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {
?
Miss one point, for debug I remove false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise.
cc @dongjoon-hyun as this is for correctness. |
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnVectorUtils
is a general library used by ORC/Parquet. I guess this can happen in ORC, too. Could you add a test case for ORC and check that too?
If it affects both Parquet/ORC, please update PR title.
Yea, checked, have same issue. Updated in SQLQuerySuite. Seems there is no UT to test partition value with OrcColumnarBatchReaderSuite, should I add one? |
Thank you for checking. Yes. Please add one, @AngersZhuuuu . |
Yea, update later. |
Kubernetes integration test starting |
Kubernetes integration test status success |
FYI @dongjoon-hyun UT added . |
.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except one minor test case renaming comment).
Thank you, @AngersZhuuuu .
Test build #132977 has finished for PR 30824 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Test build #132981 has finished for PR 30824 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132983 has finished for PR 30824 at commit
|
Kubernetes integration test starting |
Test build #132989 has finished for PR 30824 at commit
|
Kubernetes integration test status success |
…tion value Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) | USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` Fix data incorrect issue No Added UT Closes #30824 from AngersZhuuuu/SPARK-33593. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0603913) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Merged to master/3.1. Could you make backporting PRs to branch-3.0/branch-2.4, please? |
Sure, will ping you when PR ready. |
Test build #132995 has finished for PR 30824 at commit
|
What changes were proposed in this pull request?
Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT
Why are the changes needed?
Fix data incorrect issue
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added UT