Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of range when extending on a slice of string array imported through FFI #5896

Closed
viirya opened this issue Jun 15, 2024 · 3 comments · Fixed by #5895
Closed

Out of range when extending on a slice of string array imported through FFI #5896

viirya opened this issue Jun 15, 2024 · 3 comments · Fixed by #5895
Labels
arrow Changes to the arrow crate bug

Comments

@viirya
Copy link
Member

viirya commented Jun 15, 2024

Describe the bug

We move buffer pointer of offset buffer when slicing a string array and keep data buffer pointer unchanged. When exporting it through FFI, we simply export the moved pointer of the offset buffer.

When importing the array, we calculate the length of data buffer by taking the difference of last offset and first offset in the (slice) offset buffer. Note that the calculated length is not correct.

For example, the original string array's data buffer is 346536 bytes, last offset is 346536. We take a slice of 8192 strings from it, the slice of offsets are [147456, ..., 294912]. The calculated length is 294912 - 147456 = 147456. But actually the length of data buffer is 346536. So the data buffer of the imported array has incorrect length.

It doesn't cause issues so far because we access imported data buffer using pointers at most time (and we don't actually check the range). But for some cases where we access the data as slice (i.e., []), for example when we extend it, it will cause runtime panic like:

---- ffi::tests_from_ffi::test_extend_imported_string_slice stdout ----
thread 'ffi::tests_from_ffi::test_extend_imported_string_slice' panicked at arrow-data/src/transform/variable_size.rs:38:29:
range end index 10890 out of range for slice of length 5500

Note test_extend_imported_string_slice is new test I added in #5895.

It also makes the exported array possibly cannot be imported to be used in other Arrow implementations like Java Arrow: apache/arrow#42156. Because in Java Arrow, its buffer implementation ArrowBuf possibly checks index range.

To Reproduce

Expected behavior

Additional context

The bug was found during debugging apache/datafusion-comet#540.

@tustvold
Copy link
Contributor

I think this is a bug introduced by #5741

@viirya
Copy link
Member Author

viirya commented Jun 26, 2024

Yes. This how Arrow Java calculates the length of data buffer which is incorrect. I wrongly matched arrow-rs to it in the previous patch.

@alamb alamb added the arrow Changes to the arrow crate label Jul 2, 2024
@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

label_issue.py automatically added labels {'arrow'} from #5895

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants