[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value #30824

AngersZhuuuu · 2020-12-17T15:15:25Z

What changes were proposed in this pull request?

Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT

test("Parquet vector reader incorrect with binary partition value") {
  Seq(false, true).foreach(tag => {
    withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
      withTable("t1") {
        sql(
          """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
            | USING PARQUET PARTITIONED BY (part)""".stripMargin)
        sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
        if (tag) {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", ""))
        } else {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", "Spark SQL"))
        }
      }
    }
  })
}

Why are the changes needed?

Fix data incorrect issue

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

…ion value

AngersZhuuuu · 2020-12-17T15:15:44Z

FY @viirya @cloud-fan @wangyum

SparkQA · 2020-12-17T16:05:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37562/

SparkQA · 2020-12-17T16:35:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37562/

viirya · 2020-12-17T17:27:48Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java

+      } else if (t == DataTypes.BinaryType) {
+        col.putByteArray(0, row.getBinary(fieldIdx));


Hm? If there is no case for BinaryType, seems there is also no final else block, so it just leaves as unpopulate previously?

Should we add else block, and throw an exception?

Should we add else block, and throw an exception?

Updated and add UT

Hm? If there is no case for BinaryType, seems there is also no final else block, so it just leaves as unpopulate previously?

It looks like it does, so that value is empty

SparkQA · 2020-12-17T19:22:40Z

Test build #132958 has finished for PR 30824 at commit 4784edd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-12-18T01:53:07Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -3745,6 +3745,21 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
      }
    }
  }
+
+  test("SPARK-33593: Parquet vector reader incorrect with binary partition value") {


Can you add a test case in f7d2143#diff-8e4e74b1869fecc81b75575bed45b183e79cbaa97578f42c51984c9d89ece45aR694 too?

Can you add a test case in f7d2143#diff-8e4e74b1869fecc81b75575bed45b183e79cbaa97578f42c51984c9d89ece45aR694 too?

Thanks for your advise, I am looking where to add UT for this since not familiar with this part.

HyukjinKwon · 2020-12-18T01:53:39Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+
+  test("SPARK-33593: Parquet vector reader incorrect with binary partition value") {
+    Seq(true).foreach(tag => {
+      withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {


withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {?

withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {?

Miss one point, for debug I remove false

HyukjinKwon

Looks good otherwise.

viirya · 2020-12-18T02:39:32Z

cc @dongjoon-hyun as this is for correctness.

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

dongjoon-hyun

ColumnVectorUtils is a general library used by ORC/Parquet. I guess this can happen in ORC, too. Could you add a test case for ORC and check that too?

If it affects both Parquet/ORC, please update PR title.

AngersZhuuuu · 2020-12-18T03:02:49Z

ColumnVectorUtils is a general library used by ORC. I guess this can happen in ORC, too. Could you add a test case for ORC and check that too?

If it affects both Parquet/ORC, please update PR title.

Yea, checked, have same issue. Updated in SQLQuerySuite. Seems there is no UT to test partition value with OrcColumnarBatchReaderSuite, should I add one?

dongjoon-hyun · 2020-12-18T03:20:05Z

Thank you for checking. Yes. Please add one, @AngersZhuuuu .

AngersZhuuuu · 2020-12-18T03:21:01Z

Thank you for checking. Yes. Please add one, @AngersZhuuuu .

Yea, update later.

SparkQA · 2020-12-18T04:10:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37582/

SparkQA · 2020-12-18T04:41:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37582/

AngersZhuuuu · 2020-12-18T04:44:38Z

Thank you for checking. Yes. Please add one, @AngersZhuuuu .

FYI @dongjoon-hyun UT added .

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

dongjoon-hyun

+1, LGTM (except one minor test case renaming comment).
Thank you, @AngersZhuuuu .

SparkQA · 2020-12-18T05:02:32Z

Test build #132977 has finished for PR 30824 at commit eeaf38f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

Good catch!

SparkQA · 2020-12-18T06:01:06Z

Test build #132981 has finished for PR 30824 at commit e237541.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-18T06:09:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37589/

SparkQA · 2020-12-18T06:38:44Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37589/

SparkQA · 2020-12-18T06:48:50Z

Test build #132983 has finished for PR 30824 at commit 74b59ff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-18T07:09:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37594/

SparkQA · 2020-12-18T07:14:31Z

Test build #132989 has finished for PR 30824 at commit 3b1b896.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-18T07:39:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37594/

…tion value Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) | USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` Fix data incorrect issue No Added UT Closes #30824 from AngersZhuuuu/SPARK-33593. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0603913) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2020-12-18T08:04:15Z

Merged to master/3.1. Could you make backporting PRs to branch-3.0/branch-2.4, please?

AngersZhuuuu · 2020-12-18T08:07:11Z

Merged to master/3.1. Could you make backporting PRs to branch-3.0/branch-2.4, please?

Sure, will ping you when PR ready.

SparkQA · 2020-12-18T14:39:12Z

Test build #132995 has finished for PR 30824 at commit 5d55b38.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

[SPARK-33593][SQL] Parquet vector reader incorrect with binary partit…

4784edd

…ion value

github-actions bot added the SQL label Dec 17, 2020

viirya reviewed Dec 17, 2020

View reviewed changes

Update ColumnVectorUtils.java

8001fa2

HyukjinKwon reviewed Dec 18, 2020

View reviewed changes

HyukjinKwon approved these changes Dec 18, 2020

View reviewed changes

add UT

eeaf38f

viirya approved these changes Dec 18, 2020

View reviewed changes

dongjoon-hyun reviewed Dec 18, 2020

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 18, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 18, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

follow comment

e237541

dongjoon-hyun requested changes Dec 18, 2020

View reviewed changes

Update SQLQuerySuite.scala

74b59ff

AngersZhuuuu changed the title ~~[SPARK-33593][SQL] Parquet vector reader incorrect with binary partition value~~ [SPARK-33593][SQL] Vector reader got incorrect data with binary partition value Dec 18, 2020

Update OrcColumnarBatchReaderSuite.scala

ed0928c

dongjoon-hyun reviewed Dec 18, 2020

View reviewed changes

.../test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 18, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Show resolved Hide resolved

dongjoon-hyun approved these changes Dec 18, 2020

View reviewed changes

Update OrcColumnarBatchReaderSuite.scala

3b1b896

cloud-fan approved these changes Dec 18, 2020

View reviewed changes

Update SQLQuerySuite.scala

5d55b38

dongjoon-hyun closed this in 0603913 Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value #30824

[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value #30824

AngersZhuuuu commented Dec 17, 2020 •

edited by dongjoon-hyun

Loading

AngersZhuuuu commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

viirya Dec 17, 2020

viirya Dec 17, 2020

AngersZhuuuu Dec 18, 2020 •

edited

Loading

AngersZhuuuu Dec 18, 2020

SparkQA commented Dec 17, 2020

HyukjinKwon Dec 18, 2020

AngersZhuuuu Dec 18, 2020

HyukjinKwon Dec 18, 2020

AngersZhuuuu Dec 18, 2020

HyukjinKwon left a comment

viirya commented Dec 18, 2020

dongjoon-hyun left a comment •

edited

Loading

AngersZhuuuu commented Dec 18, 2020 •

edited

Loading

dongjoon-hyun commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

dongjoon-hyun left a comment

SparkQA commented Dec 18, 2020

cloud-fan left a comment

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

SparkQA commented Dec 18, 2020

		} else if (t == DataTypes.BinaryType) {
		col.putByteArray(0, row.getBinary(fieldIdx));

[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value #30824

[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value #30824

Conversation

AngersZhuuuu commented Dec 17, 2020 • edited by dongjoon-hyun Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented Dec 17, 2020

SparkQA commented Dec 17, 2020

SparkQA commented Dec 17, 2020

viirya Dec 17, 2020

Choose a reason for hiding this comment

viirya Dec 17, 2020

Choose a reason for hiding this comment

AngersZhuuuu Dec 18, 2020 • edited Loading

Choose a reason for hiding this comment

AngersZhuuuu Dec 18, 2020

Choose a reason for hiding this comment

SparkQA commented Dec 17, 2020

HyukjinKwon Dec 18, 2020

Choose a reason for hiding this comment

AngersZhuuuu Dec 18, 2020

Choose a reason for hiding this comment

HyukjinKwon Dec 18, 2020

Choose a reason for hiding this comment

AngersZhuuuu Dec 18, 2020

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

viirya commented Dec 18, 2020

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

AngersZhuuuu commented Dec 18, 2020 • edited Loading

dongjoon-hyun commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2020

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

SparkQA commented Dec 18, 2020

AngersZhuuuu commented Dec 17, 2020 •

edited by dongjoon-hyun

Loading

AngersZhuuuu Dec 18, 2020 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

AngersZhuuuu commented Dec 18, 2020 •

edited

Loading