Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison #18571

Closed
wants to merge 7 commits into from

Conversation

kiszk
Copy link
Member

@kiszk kiszk commented Jul 8, 2017

What changes were proposed in this pull request?

This PR fixes a wrong comparison for BinaryType. This PR enables unsigned comparison and unsigned prefix generation for an array for BinaryType. Previous implementations uses signed operations.

How was this patch tested?

Added a test suite in OrderingSuite.

@SparkQA
Copy link

SparkQA commented Jul 8, 2017

Test build #79381 has finished for PR 18571 at commit ac86eed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 8, 2017

Aren't byte comparisons signed though elsewhere? as well as other integer type comparison? this would become inconsistent. Maybe I'm missing the reason this should be different.

@kiszk
Copy link
Member Author

kiszk commented Jul 8, 2017

IIUC, BinaryType should be handled as unsigned byte while Java and Scala have only signed byte array.
That is why we have to do unsigned comparison of BinaryType.
The places where byte comparison is used rather than for BinaryType, I think that we do not change comparisons.

What do you think?
cc: @gatorsmile @maropu

@maropu
Copy link
Member

maropu commented Jul 9, 2017

I checked other dbms (dbms-like) systems handled binary as unassigned;

postgres=# SELECT E'\\x00'::BYTEA < E'\\xff'::BYTEA;
 ?column? 
----------
 t
(1 row)

hive> SELECT unhex('00') < unhex('ff');
OK
true

mysql> SELECT x'00' < x'ff';
+---------------+
| x'00' < x'ff' |
+---------------+
|             1 |
+---------------+

In some parts of the Spark implementation, they also handle binary as unassigned (https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java#L36). I feel, based on the principle of least astonishment, we'd be better to handle binary as unassigned in user-facing pieces.

@kiszk
Copy link
Member Author

kiszk commented Jul 9, 2017

@maropu, thank you for your comment.


test("SPARK-21344: BinaryType comparison does signed byte array comparison") {
val b1 = Array[Byte](1) // 0x01
val b2 = Array[Byte](-1) // 0xff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @kiszk
Can we have another test case with more than one byte for code coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment, done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it looks better to check the boundary values like -128 and 127.

@SparkQA
Copy link

SparkQA commented Jul 9, 2017

Test build #79425 has finished for PR 18571 at commit 0fea784.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -44,7 +44,7 @@ public static long getPrefix(byte[] bytes) {
final int minLen = Math.min(bytes.length, 8);
long p = 0;
for (int i = 0; i < minLen; ++i) {
p |= (128L + Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i))
p |= ((long)Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i) & 0xff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe we need a single space like ((long) Platform....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. done.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79436 has finished for PR 18571 at commit 814d4c4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79439 has finished for PR 18571 at commit a84776b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 10, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79450 has finished for PR 18571 at commit a84776b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 10, 2017

@gatorsmile could you please review this?

@gatorsmile
Copy link
Member

SELECT x'00' < x'0f'
SELECT x'00' < x'ff'

Could you add the above queries to a new comparator.sql file?

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79600 has finished for PR 18571 at commit e33ec8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 14, 2017

@gatorsmile thank you for your comment, done

(Array[Byte](1, 1, 1, 1, 1, 1, 1, 1, 1), Array[Byte](1, 1, 1, 1, 1, 1, 1, 1, -1))
)
data.foreach { case (b1, b2) =>
val rowOrdering = InterpretedOrdering.forSchema(Seq(BinaryType, BinaryType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, don't you just have one binary field for each row? Why you specify two fields to compare?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated.

BoundReference(1, BinaryType, nullable = true).asc :: Nil)
val rowType = StructType(
StructField("b1", BinaryType, nullable = true) ::
StructField("b2", BinaryType, nullable = true) :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You specify two binary fields as row schema. But actually your input just has one binary field (i.e., Row(b1)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood, updated.

@@ -44,7 +44,7 @@ public static long getPrefix(byte[] bytes) {
final int minLen = Math.min(bytes.length, 8);
long p = 0;
for (int i = 0; i < minLen; ++i) {
p |= (128L + Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i))
p |= ((long) Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i) & 0xff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we don't have unit test for this. Shall we add a ByteArraySuite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the unit test for this. I added some cases.

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79612 has finished for PR 18571 at commit 08776e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

It sounds like all the comments from @viirya have been resolved. Will merge it tonight. Thanks!

@viirya
Copy link
Member

viirya commented Jul 14, 2017

LGTM

asfgit pushed a commit that referenced this pull request Jul 15, 2017
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
asfgit pushed a commit that referenced this pull request Jul 15, 2017
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
asfgit pushed a commit that referenced this pull request Jul 15, 2017
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.2/2.1/2.0

@asfgit asfgit closed this in ac5d5d7 Jul 15, 2017
asfgit pushed a commit that referenced this pull request Jul 16, 2017
… representation

## What changes were proposed in this pull request?
SPARK 2.0 does not support hex literal. Thus, the test case failed after backporting #18571

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18643 from gatorsmile/fixTestFailure2.0.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes apache#18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
…rison

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes apache#18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants