-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27858][SQL] Fix for avro deserialization on union types with multiple non-null types #24722
Conversation
Thank you for your first contribution, @gcmerz ! |
ok to test |
Test build #105850 has finished for PR 24722 at commit
|
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two minor comments on the test case. Could you update them? Also, I updated the PR title and description a little bit.
cc @dbtsai and @gengliangwang |
Applied the tweaks--thank you so much for the quick review! |
You're welcome. Thank you for swift update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Test build #105855 has finished for PR 24722 at commit
|
Merged to |
…ultiple non-null types ## What changes were proposed in this pull request? This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out. ## How was this patch tested? A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite` Closes #24722 from gcmerz/master. Authored-by: Gabbi Merz <gmerz@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 29e154b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@gcmerz . What is your id in Apache JIRA? If you don't have, please create one. Then, I can assign that issue to you. And, FYI, in GitHub personal setting, you can additionally add your Palantir email (used in this PR). Then, your commit with Palantir ID also will show your GitHub profile. |
A late LGTM! Thanks for the fix @gcmerz |
@dongjoon-hyun Made a Jira account with username "gcmerz" and added palantir email. Thanks again! |
It's assigned to you, @gcmerz . Welcome to the Apache Spark community! |
Thank you for review, @gengliangwang . |
…ultiple non-null types ## What changes were proposed in this pull request? This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out. ## How was this patch tested? A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite` Closes apache#24722 from gcmerz/master. Authored-by: Gabbi Merz <gmerz@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ultiple non-null types ## What changes were proposed in this pull request? This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out. ## How was this patch tested? A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite` Closes apache#24722 from gcmerz/master. Authored-by: Gabbi Merz <gmerz@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 29e154b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ultiple non-null types ## What changes were proposed in this pull request? This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out. ## How was this patch tested? A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite` Closes apache#24722 from gcmerz/master. Authored-by: Gabbi Merz <gmerz@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 29e154b) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…ultiple non-null types ## What changes were proposed in this pull request? This PR aims to fix an issue on a union avro type with more than one non-null value (for instance `["string", "null", "int"]`) whose the deserialization to a DataFrame would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the `fieldWriter` relied on the index from the avro schema before nulls were filtered out. ## How was this patch tested? A test for the case of multiple non-null values was added and the tests were run using sbt by running `testOnly org.apache.spark.sql.avro.AvroSuite` Closes apache#24722 from gcmerz/master. Authored-by: Gabbi Merz <gmerz@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 29e154b) RB=2633258 G=spark-reviewers R=ekrogen,wyzhang,wmoustaf A=ekrogen
What changes were proposed in this pull request?
This PR aims to fix an issue on a union avro type with more than one non-null value (for instance
["string", "null", "int"]
) whose the deserialization to a DataFrame would throw ajava.lang.ArrayIndexOutOfBoundsException
. The issue was that thefieldWriter
relied on the index from the avro schema before nulls were filtered out.How was this patch tested?
A test for the case of multiple non-null values was added and the tests were run using sbt by running
testOnly org.apache.spark.sql.avro.AvroSuite