-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow: add support for null vectors #10953
base: main
Are you sure you want to change the base?
Conversation
Fix NullPointerException when trying to add the vector's class name to the message for an UnsupportedOperationException
This test more closely follows the reproduction steps described in issue apache#10275
…eaderTest.java Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
…eaderTest.java Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
This solution hacks in a VectorHolder instance built specifically for the missing column. Implementing this hack allowed me to explore what would be needed to support vectorized reading of null columns
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sl255051 I spent some time understanding the problem and going through the code and fixing it myself (without first looking at this PR). I think you're on the right track here to handle the root cause and we should have a NullAccessor
that internally tracks a NullVector
.
I would suggest to squash all the commits to a single one and then go through my comments
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
...src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java
Outdated
Show resolved
Hide resolved
Thank you, @nastra, for your help. I will be on vacation next week. I will pick this up again when I return on September 3.
|
Update unit test to write a row after the schema has been altered. The test will then verify that all rows written both before and after the schema change can be correctly read.
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
Adding a second row was creating test complexity. The order in which the two rows are read asynchronously was creating randomness thus making it hard to predict the expected values. I'm not sure adding a second row of data was really adding any benefit anyway.
Those two test helper methods are highly tuned for a specific schema, a schema that does not exist in this test.
@amogh-jahagirdar This PR is ready for your review |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
These unit tests, particularly `testIsDummy1` and `testIsDummy2`, exposed a bug in the code where the `isDummy` method no longer returned the expected value.
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Outdated
Show resolved
Hide resolved
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java
Outdated
Show resolved
Hide resolved
@amogh-jahagirdar I am not planning to make any additional changes. Please review |
Add new class NullAccessor to aid reading null columns. A null column is a column that exists in the Iceberg schema, but does not exist in the parquet schema. A column would exist in Iceberg, but not in Parquet when a column is added to the Iceberg schema, but no new rows have been added to the table. See ArrowReadetTest.testReadColumnThatDoesNotExistInParquetSchema for an example.