Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We read strings as
ByteArray
physical values from the Arrow library, and these values are just a view (pointer + length) into an internal buffer in the Arrow library. When reading arrays of string values, we don't immediately convert these to dotnet strings, but buffer theseByteArray
values and then convert them to an array of dotnet strings once we have read all the values for the array.This causes issues when Arrow internally reads a new page of data before we've converted all our currently buffered values. A new data page read can be triggered just by calling
ColumnReader.HasNext
. In that case we're no longer guaranteed that the previously read values are pointing to the correct data. The memory they're pointing at could have been overwritten so we end up reading garbage, or it could have been freed, so we get an access violation exception.This PR fixes reading string array values by always converting data to logical values immediately after reading the physical values in the
BufferedReader
.This was a little bit awkward to implement as the behaviour of the converter delegate depends on whether the leaf values are nullable or not. For nullable values we read definition levels and will include null values in the converted data, but for non-nullable leaf values (even when within a nullable group or nested lists) we ignore the definition levels during conversion and just return a list of the defined values. Previously this was fine as the
ReadArrayLeafLevel
method would build up a list of definition levels itself corresponding to all (possibly null) values in the leaf level array, but now theBufferedReader
needs to know about this distinction and handle getting values differently.I don't think it would make sense to change this behaviour so that converters would always behave consistently. That would probably reduce performance compared to being able to naively copy values when the physical and logical type have the same representation in memory, and the
LogicalReadConverterFactory
is part of the public API so it would be a breaking change for anyone that has implemented their own conversion logic.