-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Handle offset
field from ArrowArray
when BufferImportTypeVisitor
imports offset buffer
#42156
Comments
Ah, thanks for the explanation. In general Arrow Java does not support CC @vibhatha |
I will take a look. |
To be clear, the source of the issue comes from apache/arrow-rs#5896 where it exports moved pointer of offer buffer of a slice of string array. Instead, we should use But actually seems either |
BufferImportTypeVisitor
for slice of string arrayoffset field from
ArrowArray when
BufferImportTypeVisitor` imports offset buffer
offset field from
ArrowArray when
BufferImportTypeVisitor` imports offset bufferoffset
field from ArrowArray
when BufferImportTypeVisitor
imports offset buffer
Ah, thanks. Yes, either way, we need to consider offset in Java and figure out how best to deal with it. |
Yes. I'm going to fix arrow-rs to have correct offset field and we also need to handle offset in Java to make it work. |
One easy method to support importing sliced arrays in Java would be to import all arrays as if they had offset 0, then use |
If implementations can handle offset buffers that do not start at zero, then C data arrays with offsets should be possible to import following one of these two cases:
The latter means that offset buffers won't necessarily start at zero, which not all implementations support right now. This is something we fixed recently in Go, also related to datafusion's shared buffer usage: #41993. If implementations can support this case, then implementing C data offsets should become simpler. |
Describe the bug, including details regarding any error messages, version, and platform.
This bug is found during debugging the issue apache/datafusion-comet#540.
We found that some string arrays' offsets are out of the range of their value buffer. I.e., a string array's value buffer is only 147456 bytes, but the offsets of the last string is (294894, 294912).
The string is output from DataFusion aggregation operator
AggregateExec
. When producing the output batch, the operator will possibly slice the output batch if it is larger than a configured size. The slice of a string array in arrow-rs, keeps original value buffer and moves the pointer of offset buffer so it is a zero-copy slice.During
importOffsets
call inBufferImportTypeVisitor
, the slice of offset can be imported correctly as it uses the moved pointer and calculates the offset buffer correctly.But when it goes to import value buffer, it calculates the capacity of it by using the difference between imported last offset and first offset. Because the imported offsets are from the slice, the calculated capacity is only for the certain slice of the value buffer.
For example, the original string array's value buffer is 346536 bytes, last offset is 346536. We take a slice of 8192 strings from it. The slice array's last offset is 294912 but the value buffer is the same (346536 bytes).
When
BufferImportTypeVisitor
imports the slice, the imported offsets are [147456, ..., 294912]. It calculates the length of value buffer is294912 - 147456 = 147456
. But actually the length of value buffer is 346536.Obviously the offsets are now out of range of the incorrect value buffer size 147456.
To be clear, the source of the issue comes from apache/arrow-rs#5896 where it exports moved pointer of offer buffer of a slice of string array. Instead, we should use
offset
field inArrowArray
for it. We are going to fix it in arrow-rs for that.But actually seems either ArrayImporter or
BufferImportTypeVisitor
also doesn't consideroffset
inArrowArray
.Component(s)
Java
The text was updated successfully, but these errors were encountered: