Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Handle offset field from ArrowArray when BufferImportTypeVisitor imports offset buffer #42156

Open
viirya opened this issue Jun 15, 2024 · 8 comments

Comments

@viirya
Copy link
Member

viirya commented Jun 15, 2024

Describe the bug, including details regarding any error messages, version, and platform.

This bug is found during debugging the issue apache/datafusion-comet#540.

We found that some string arrays' offsets are out of the range of their value buffer. I.e., a string array's value buffer is only 147456 bytes, but the offsets of the last string is (294894, 294912).

The string is output from DataFusion aggregation operator AggregateExec. When producing the output batch, the operator will possibly slice the output batch if it is larger than a configured size. The slice of a string array in arrow-rs, keeps original value buffer and moves the pointer of offset buffer so it is a zero-copy slice.

During importOffsets call in BufferImportTypeVisitor, the slice of offset can be imported correctly as it uses the moved pointer and calculates the offset buffer correctly.

But when it goes to import value buffer, it calculates the capacity of it by using the difference between imported last offset and first offset. Because the imported offsets are from the slice, the calculated capacity is only for the certain slice of the value buffer.

For example, the original string array's value buffer is 346536 bytes, last offset is 346536. We take a slice of 8192 strings from it. The slice array's last offset is 294912 but the value buffer is the same (346536 bytes).

When BufferImportTypeVisitor imports the slice, the imported offsets are [147456, ..., 294912]. It calculates the length of value buffer is 294912 - 147456 = 147456. But actually the length of value buffer is 346536.

Obviously the offsets are now out of range of the incorrect value buffer size 147456.

To be clear, the source of the issue comes from apache/arrow-rs#5896 where it exports moved pointer of offer buffer of a slice of string array. Instead, we should use offset field in ArrowArray for it. We are going to fix it in arrow-rs for that.

But actually seems either ArrayImporter or BufferImportTypeVisitor also doesn't consider offset in ArrowArray.

Component(s)

Java

@lidavidm
Copy link
Member

Ah, thanks for the explanation.

In general Arrow Java does not support offset in the same way that C++ and Rust do (this is purely a fiction of their implementations and not part of IPC, though now C Data forces Java to deal with it). But we can correct the problem here at least

CC @vibhatha

@vibhatha
Copy link
Collaborator

I will take a look.

@viirya
Copy link
Member Author

viirya commented Jun 15, 2024

To be clear, the source of the issue comes from apache/arrow-rs#5896 where it exports moved pointer of offer buffer of a slice of string array. Instead, we should use offset field in ArrowArray for it. We are going to fix it in arrow-rs for that.

But actually seems either ArrayImporter or BufferImportTypeVisitor also doesn't consider offset in ArrowArray. So we still need to make BufferImportTypeVisitor handle offset correctly.

@viirya viirya changed the title [Java] Incorrect calculation of data buffer length in BufferImportTypeVisitor for slice of string array [Java] Handle offset field from ArrowArray when BufferImportTypeVisitor` imports offset buffer Jun 15, 2024
@viirya viirya changed the title [Java] Handle offset field from ArrowArray when BufferImportTypeVisitor` imports offset buffer [Java] Handle offset field from ArrowArray when BufferImportTypeVisitor imports offset buffer Jun 15, 2024
@lidavidm
Copy link
Member

Ah, thanks. Yes, either way, we need to consider offset in Java and figure out how best to deal with it.

@viirya
Copy link
Member Author

viirya commented Jun 17, 2024

Ah, thanks. Yes, either way, we need to consider offset in Java and figure out how best to deal with it.

Yes. I'm going to fix arrow-rs to have correct offset field and we also need to handle offset in Java to make it work.

@vibhatha
Copy link
Collaborator

vibhatha commented Aug 8, 2024

@lidavidm @bkietz I have been working on a solution for this issue in the Java side, but I later realized that we don't have any C data interface tests to evaluate something like slicing.

@bkietz
Copy link
Member

bkietz commented Aug 8, 2024

One easy method to support importing sliced arrays in Java would be to import all arrays as if they had offset 0, then use splitAndTransfer() to effect the slicing after import is complete. IIUC, since importing is zero copy and splitAndTransfer() is too (when possible) this should be about as efficient as handling the slice offset directly

@joellubi
Copy link
Member

joellubi commented Aug 8, 2024

If implementations can handle offset buffers that do not start at zero, then C data arrays with offsets should be possible to import following one of these two cases:

  • For fixed-width types, slice the value buffer starting at offset*width bytes.
  • For binary or list types, instead slice the offset buffer at offset bytes.

The latter means that offset buffers won't necessarily start at zero, which not all implementations support right now. This is something we fixed recently in Go, also related to datafusion's shared buffer usage: #41993.

If implementations can support this case, then implementing C data offsets should become simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants