Arrow: add support for null vectors #10953

slessard · 2024-08-16T19:53:57Z

Add new class NullAccessor to aid reading null columns. A null column is a column that exists in the Iceberg schema, but does not exist in the parquet schema. A column would exist in Iceberg, but not in Parquet when a column is added to the Iceberg schema, but no new rows have been added to the table. See ArrowReadetTest.testReadColumnThatDoesNotExistInParquetSchema for an example.

Fix NullPointerException when trying to add the vector's class name to the message for an UnsupportedOperationException

This test more closely follows the reproduction steps described in issue apache#10275

…o issue-10275

…eaderTest.java Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

This solution hacks in a VectorHolder instance built specifically for the missing column. Implementing this hack allowed me to explore what would be needed to support vectorized reading of null columns

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java

nastra

@sl255051 I spent some time understanding the problem and going through the code and fixing it myself (without first looking at this PR). I think you're on the right track here to handle the root cause and we should have a NullAccessor that internally tracks a NullVector.

I would suggest to squash all the commits to a single one and then go through my comments

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedReaderBuilder.java

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java

...src/test/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactoryTest.java

sl255051 · 2024-08-24T01:02:14Z

Thank you, @nastra, for your help. I will be on vacation next week. I will pick this up again when I return on September 3.

Update unit test to write a row after the schema has been altered. The test will then verify that all rows written both before and after the schema change can be correctly read.

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java

Adding a second row was creating test complexity. The order in which the two rows are read asynchronously was creating randomness thus making it hard to predict the expected values. I'm not sure adding a second row of data was really adding any benefit anyway.

Those two test helper methods are highly tuned for a specific schema, a schema that does not exist in this test.

slessard · 2024-09-20T22:51:53Z

@amogh-jahagirdar This PR is ready for your review

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

These unit tests, particularly `testIsDummy1` and `testIsDummy2`, exposed a bug in the code where the `isDummy` method no longer returned the expected value.

…eached

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/VectorHolderTest.java

slessard · 2024-09-26T16:33:35Z

@amogh-jahagirdar I am not planning to make any additional changes. Please review

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java

…Holder.java

…75-alt3

sl255051 and others added 23 commits May 7, 2024 18:21

apache#10275 - fix NullPointerException

ac6440a

Fix NullPointerException when trying to add the vector's class name to the message for an UnsupportedOperationException

Change how the unit test asserts the correct exception is thrown

becf6f7

Remove test dependency on Apache Spark

4e2cb86

Merge branch 'main' into issue-10275

1193d02

Add new unit test

12bc3de

This test more closely follows the reproduction steps described in issue apache#10275

Merge branch 'apache:main' into issue-10275

d8f3e13

Add comments to unit test

bb4e010

Merge branch 'issue-10275' of https://github.com/slessard/iceberg int…

6e7a1aa

…o issue-10275

Update arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowR…

28451a5

…eaderTest.java Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

Update arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowR…

24a9932

…eaderTest.java Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

Address code review comments

9bcb2b1

Merge branch 'apache:main' into issue-10275

7a25b52

Merge branch 'main' into issue-10275

a31bf94

Merge branch 'main' into issue-10275

44a7f91

Merge branch 'apache:main' into issue-10275

c2eaf24

DRAFT: alternate solution 2: hack in support for NullVector

e323db7

This solution hacks in a VectorHolder instance built specifically for the missing column. Implementing this hack allowed me to explore what would be needed to support vectorized reading of null columns

Merge branch 'apache:main' into issue-10275-alt2

061ab02

Issue 10275 - Add rough draft vector support for null columns

bf0c905

Merge branch 'issue-10275-alt2' into issue-10275-alt3

5610dd4

Merge branch 'main' into issue-10275-alt3

a13415d

Merge branch 'main' into issue-10275-alt3

2eaa63f

remove obsolete comment; adapt unit test to match new functionality

62108da

Merge branch 'apache:main' into issue-10275-alt3

7115e93

github-actions bot added the arrow label Aug 16, 2024

nastra reviewed Aug 22, 2024

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java Outdated Show resolved Hide resolved

nastra reviewed Aug 22, 2024

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java Outdated Show resolved Hide resolved

nastra reviewed Aug 22, 2024

View reviewed changes

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/GenericArrowVectorAccessorFactory.java Outdated Show resolved Hide resolved

nastra reviewed Aug 22, 2024

View reviewed changes

Address code review feedback

08bb07c

nastra changed the title ~~DRAFT - Issue 10275 - Add support for null vectors~~ Arrow: add support for null vectors Sep 13, 2024

slessard added 2 commits September 17, 2024 10:21

Adopt changes suggested by @nastra in code review

cda0423

Update unit test to add a second row to the table being tested

9aec9e5

Update unit test to write a row after the schema has been altered. The test will then verify that all rows written both before and after the schema change can be correctly read.

nastra reviewed Sep 18, 2024

View reviewed changes

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java Outdated Show resolved Hide resolved

nastra reviewed Sep 18, 2024

View reviewed changes

arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java Outdated Show resolved Hide resolved

slessard added 4 commits September 20, 2024 11:22

Code cleanup

0c87dc7

Expand calls to checkAllVectorTypes and checkAllVectorValues

fe60793

Those two test helper methods are highly tuned for a specific schema, a schema that does not exist in this test.

replace hard-coded magic values with descriptively named variables

1a3896b

slessard marked this pull request as ready for review September 20, 2024 22:51

slessard commented Sep 24, 2024

View reviewed changes

slessard added 4 commits September 24, 2024 11:30

Add unit tests for VectorHolder

5c3b460

These unit tests, particularly `testIsDummy1` and `testIsDummy2`, exposed a bug in the code where the `isDummy` method no longer returned the expected value.

Update isDummy method to remove one condition that would never be r…

a2df95c

…eached

Fix code style issues

bbc776d

Update VectorHolder unit tests for isDummy method

2bf5b2f