Fix reading string array values #282

adamreeve · 2022-06-10T05:04:29Z

We read strings as ByteArray physical values from the Arrow library, and these values are just a view (pointer + length) into an internal buffer in the Arrow library. When reading arrays of string values, we don't immediately convert these to dotnet strings, but buffer these ByteArray values and then convert them to an array of dotnet strings once we have read all the values for the array.

This causes issues when Arrow internally reads a new page of data before we've converted all our currently buffered values. A new data page read can be triggered just by calling ColumnReader.HasNext. In that case we're no longer guaranteed that the previously read values are pointing to the correct data. The memory they're pointing at could have been overwritten so we end up reading garbage, or it could have been freed, so we get an access violation exception.

This PR fixes reading string array values by always converting data to logical values immediately after reading the physical values in the BufferedReader.

This was a little bit awkward to implement as the behaviour of the converter delegate depends on whether the leaf values are nullable or not. For nullable values we read definition levels and will include null values in the converted data, but for non-nullable leaf values (even when within a nullable group or nested lists) we ignore the definition levels during conversion and just return a list of the defined values. Previously this was fine as the ReadArrayLeafLevel method would build up a list of definition levels itself corresponding to all (possibly null) values in the leaf level array, but now the BufferedReader needs to know about this distinction and handle getting values differently.

I don't think it would make sense to change this behaviour so that converters would always behave consistently. That would probably reduce performance compared to being able to naively copy values when the physical and logical type have the same representation in memory, and the LogicalReadConverterFactory is part of the public API so it would be a breaking change for anyone that has implemented their own conversion logic.

…Exception

We need to convert buffered physical values immediately, otherwise we can get AcessViolationExceptions trying to convert ByteArray values that point to freed memory.

adamreeve requested a review from philjdf June 10, 2022 05:04

adamreeve added 4 commits June 10, 2022 17:05

Add test for reading nested string arrays that throws AccessViolation…

47f2804

…Exception

Replace test with one that generates data at runtime

08ade67

Fix reading nested strings

d511a27

We need to convert buffered physical values immediately, otherwise we can get AcessViolationExceptions trying to convert ByteArray values that point to freed memory.

Tidy up unused parameter and fix formatting

2d6d797

adamreeve force-pushed the string_array_fix branch from 56aa92e to 2d6d797 Compare June 10, 2022 05:05

Attempt to fix test build failure in CI

372d646

philjdf approved these changes Jun 10, 2022

View reviewed changes

adamreeve merged commit 1bc7bb8 into G-Research:master Jun 12, 2022

adamreeve deleted the string_array_fix branch June 12, 2022 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reading string array values #282

Fix reading string array values #282

adamreeve commented Jun 10, 2022

Fix reading string array values #282

Fix reading string array values #282

Conversation

adamreeve commented Jun 10, 2022