Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading string array values #282

Merged
merged 5 commits into from
Jun 12, 2022

Conversation

adamreeve
Copy link
Contributor

We read strings as ByteArray physical values from the Arrow library, and these values are just a view (pointer + length) into an internal buffer in the Arrow library. When reading arrays of string values, we don't immediately convert these to dotnet strings, but buffer these ByteArray values and then convert them to an array of dotnet strings once we have read all the values for the array.

This causes issues when Arrow internally reads a new page of data before we've converted all our currently buffered values. A new data page read can be triggered just by calling ColumnReader.HasNext. In that case we're no longer guaranteed that the previously read values are pointing to the correct data. The memory they're pointing at could have been overwritten so we end up reading garbage, or it could have been freed, so we get an access violation exception.

This PR fixes reading string array values by always converting data to logical values immediately after reading the physical values in the BufferedReader.

This was a little bit awkward to implement as the behaviour of the converter delegate depends on whether the leaf values are nullable or not. For nullable values we read definition levels and will include null values in the converted data, but for non-nullable leaf values (even when within a nullable group or nested lists) we ignore the definition levels during conversion and just return a list of the defined values. Previously this was fine as the ReadArrayLeafLevel method would build up a list of definition levels itself corresponding to all (possibly null) values in the leaf level array, but now the BufferedReader needs to know about this distinction and handle getting values differently.

I don't think it would make sense to change this behaviour so that converters would always behave consistently. That would probably reduce performance compared to being able to naively copy values when the physical and logical type have the same representation in memory, and the LogicalReadConverterFactory is part of the public API so it would be a breaking change for anyone that has implemented their own conversion logic.

@adamreeve adamreeve requested a review from philjdf June 10, 2022 05:04
We need to convert buffered physical values immediately, otherwise
we can get AcessViolationExceptions trying to convert ByteArray values
that point to freed memory.
@adamreeve adamreeve merged commit 1bc7bb8 into G-Research:master Jun 12, 2022
@adamreeve adamreeve deleted the string_array_fix branch June 12, 2022 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants