Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: add support for null vectors #10953

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

slessard
Copy link

@slessard slessard commented Aug 16, 2024

Add new class NullAccessor to aid reading null columns. A null column is a column that exists in the Iceberg schema, but does not exist in the parquet schema. A column would exist in Iceberg, but not in Parquet when a column is added to the Iceberg schema, but no new rows have been added to the table. See ArrowReadetTest.testReadColumnThatDoesNotExistInParquetSchema for an example.

sl255051 and others added 23 commits May 7, 2024 18:21
Fix NullPointerException when trying to add the vector's class name to the message for an UnsupportedOperationException
This test more closely follows the reproduction steps described in issue apache#10275
…eaderTest.java

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
…eaderTest.java

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
This solution hacks in a VectorHolder instance built specifically for the missing column. Implementing this hack allowed me to explore what would be needed to support vectorized reading of null columns
@github-actions github-actions bot added the arrow label Aug 16, 2024
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sl255051 I spent some time understanding the problem and going through the code and fixing it myself (without first looking at this PR). I think you're on the right track here to handle the root cause and we should have a NullAccessor that internally tracks a NullVector.

I would suggest to squash all the commits to a single one and then go through my comments

@sl255051
Copy link
Contributor

sl255051 commented Aug 24, 2024 via email

@nastra nastra changed the title DRAFT - Issue 10275 - Add support for null vectors Arrow: add support for null vectors Sep 13, 2024
Update unit test to write a row after the schema has been altered. The test will then verify that all rows written both before and after the schema change can be correctly read.
Adding a second row was creating test complexity. The order in which the two rows are read asynchronously was creating randomness thus making it hard to predict the expected values. I'm not sure adding a second row of data was really adding any benefit anyway.
Those two test helper methods are highly tuned for a specific schema, a schema that does not exist in this test.
@slessard slessard marked this pull request as ready for review September 20, 2024 22:51
@slessard
Copy link
Author

@amogh-jahagirdar This PR is ready for your review

These unit tests, particularly `testIsDummy1` and `testIsDummy2`, exposed a bug in the code where the `isDummy` method no longer returned the expected value.
@slessard
Copy link
Author

@amogh-jahagirdar I am not planning to make any additional changes. Please review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants