Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types #24722

Closed
wants to merge 2 commits into from

Conversation

gcmerz
Copy link
Contributor

@gcmerz gcmerz commented May 27, 2019

What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance ["string", "null", "int"]) whose the deserialization to a DataFrame would throw a java.lang.ArrayIndexOutOfBoundsException. The issue was that the fieldWriter relied on the index from the avro schema before nulls were filtered out.

How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running testOnly org.apache.spark.sql.avro.AvroSuite

@gcmerz gcmerz changed the title Fix for avro deserialization on union types with multiple non-null types [External/Avro] Fix for avro deserialization on union types with multiple non-null types May 27, 2019
@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @gcmerz !

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105850 has finished for PR 24722 at commit ee97960.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [External/Avro] Fix for avro deserialization on union types with multiple non-null types [SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types May 28, 2019
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments on the test case. Could you update them? Also, I updated the PR title and description a little bit.

@dongjoon-hyun
Copy link
Member

cc @dbtsai and @gengliangwang

@gcmerz
Copy link
Contributor Author

gcmerz commented May 28, 2019

Applied the tweaks--thank you so much for the quick review!

@dongjoon-hyun
Copy link
Member

You're welcome. Thank you for swift update.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105855 has finished for PR 24722 at commit 4edbe09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Merged to master and branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request May 28, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes #24722 from gcmerz/master.

Authored-by: Gabbi Merz <gmerz@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 29e154b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 28, 2019

@gcmerz . What is your id in Apache JIRA? If you don't have, please create one. Then, I can assign that issue to you.

And, FYI, in GitHub personal setting, you can additionally add your Palantir email (used in this PR). Then, your commit with Palantir ID also will show your GitHub profile.

@gengliangwang
Copy link
Member

A late LGTM! Thanks for the fix @gcmerz

@gcmerz
Copy link
Contributor Author

gcmerz commented May 28, 2019

@dongjoon-hyun Made a Jira account with username "gcmerz" and added palantir email. Thanks again!

@dongjoon-hyun
Copy link
Member

It's assigned to you, @gcmerz . Welcome to the Apache Spark community!

@dongjoon-hyun
Copy link
Member

Thank you for review, @gengliangwang .

vinooganesh pushed a commit to palantir/spark that referenced this pull request May 28, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <gmerz@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <gmerz@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 29e154b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <gmerz@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 29e154b)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…ultiple non-null types

## What changes were proposed in this pull request?

This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out.

## How was this patch tested?

A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite`

Closes apache#24722 from gcmerz/master.

Authored-by: Gabbi Merz <gmerz@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 29e154b)

RB=2633258
G=spark-reviewers
R=ekrogen,wyzhang,wmoustaf
A=ekrogen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants