Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value #30824

Closed
wants to merge 8 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Dec 17, 2020

What changes were proposed in this pull request?

Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT

test("Parquet vector reader incorrect with binary partition value") {
  Seq(false, true).foreach(tag => {
    withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
      withTable("t1") {
        sql(
          """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
            | USING PARQUET PARTITIONED BY (part)""".stripMargin)
        sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
        if (tag) {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", ""))
        } else {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", "Spark SQL"))
        }
      }
    }
  })
}

Why are the changes needed?

Fix data incorrect issue

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

@AngersZhuuuu
Copy link
Contributor Author

FY @viirya @cloud-fan @wangyum

@github-actions github-actions bot added the SQL label Dec 17, 2020
@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37562/

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37562/

Comment on lines +57 to +58
} else if (t == DataTypes.BinaryType) {
col.putByteArray(0, row.getBinary(fieldIdx));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? If there is no case for BinaryType, seems there is also no final else block, so it just leaves as unpopulate previously?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add else block, and throw an exception?

Copy link
Contributor Author

@AngersZhuuuu AngersZhuuuu Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add else block, and throw an exception?

Updated and add UT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? If there is no case for BinaryType, seems there is also no final else block, so it just leaves as unpopulate previously?

It looks like it does, so that value is empty

@SparkQA
Copy link

SparkQA commented Dec 17, 2020

Test build #132958 has finished for PR 30824 at commit 4784edd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -3745,6 +3745,21 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
}
}
}

test("SPARK-33593: Parquet vector reader incorrect with binary partition value") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case in f7d2143#diff-8e4e74b1869fecc81b75575bed45b183e79cbaa97578f42c51984c9d89ece45aR694 too?

Thanks for your advise, I am looking where to add UT for this since not familiar with this part.


test("SPARK-33593: Parquet vector reader incorrect with binary partition value") {
Seq(true).foreach(tag => {
withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true") {?

Miss one point, for debug I remove false

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise.

@viirya
Copy link
Member

viirya commented Dec 18, 2020

cc @dongjoon-hyun as this is for correctness.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnVectorUtils is a general library used by ORC/Parquet. I guess this can happen in ORC, too. Could you add a test case for ORC and check that too?

If it affects both Parquet/ORC, please update PR title.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Dec 18, 2020

ColumnVectorUtils is a general library used by ORC. I guess this can happen in ORC, too. Could you add a test case for ORC and check that too?

If it affects both Parquet/ORC, please update PR title.

Yea, checked, have same issue. Updated in SQLQuerySuite. Seems there is no UT to test partition value with OrcColumnarBatchReaderSuite, should I add one?

@dongjoon-hyun
Copy link
Member

Thank you for checking. Yes. Please add one, @AngersZhuuuu .

@AngersZhuuuu
Copy link
Contributor Author

Thank you for checking. Yes. Please add one, @AngersZhuuuu .

Yea, update later.

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-33593][SQL] Parquet vector reader incorrect with binary partition value [SPARK-33593][SQL] Vector reader got incorrect data with binary partition value Dec 18, 2020
@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37582/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37582/

@AngersZhuuuu
Copy link
Contributor Author

Thank you for checking. Yes. Please add one, @AngersZhuuuu .

FYI @dongjoon-hyun UT added .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except one minor test case renaming comment).
Thank you, @AngersZhuuuu .

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132977 has finished for PR 30824 at commit eeaf38f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132981 has finished for PR 30824 at commit e237541.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37589/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37589/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132983 has finished for PR 30824 at commit 74b59ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37594/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132989 has finished for PR 30824 at commit 3b1b896.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37594/

dongjoon-hyun pushed a commit that referenced this pull request Dec 18, 2020
…tion value

Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT
```scala
test("Parquet vector reader incorrect with binary partition value") {
  Seq(false, true).foreach(tag => {
    withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
      withTable("t1") {
        sql(
          """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
            | USING PARQUET PARTITIONED BY (part)""".stripMargin)
        sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
        if (tag) {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", ""))
        } else {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", "Spark SQL"))
        }
      }
    }
  })
}
```

Fix data incorrect issue

No

Added UT

Closes #30824 from AngersZhuuuu/SPARK-33593.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 0603913)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Merged to master/3.1. Could you make backporting PRs to branch-3.0/branch-2.4, please?

@AngersZhuuuu
Copy link
Contributor Author

Merged to master/3.1. Could you make backporting PRs to branch-3.0/branch-2.4, please?

Sure, will ping you when PR ready.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132995 has finished for PR 30824 at commit 5d55b38.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants