IPC big endian offsets are not translated #859

alamb · 2021-10-24T12:26:25Z

Describe the bug
../arrow-ipc-stream/integration/1.0.0-bigendian/generated_dictionary.arrow_file contains a UTF8 Arrow array somewhere encoded in big endian.

When this is read in to the arrow-rs implementation, the offsets buffer remains big endian, even though the code assumes the offsets buffer has values in native endianness (e.g. the offsets of the created arrow-rs buffer incorrect on little endian machines like x86)

To Reproduce
See test read_dictionary_be_not_implemented #810

It fails with Length spanned by offsets in Utf8 (687865856) is larger than the values array size (41)

Expected behavior
The test should pass (likely by translating offsets from big endian to native endianness)

Additional context
Found while adding validation in #810

The text was updated successfully, but these errors were encountered:

tustvold · 2022-04-16T16:28:09Z

The reference here suggests it is acceptable for implementations to simply not support files with non-native byte order

At first we will return an error when trying to read a Schema with an endianness that does not match the underlying system. The reference implementation is focused on Little Endian and provides tests for it. Eventually we may provide automatic conversion via byte swapping.

I therefore think an acceptable approach would be to return an error if attempting to read a file with a non-native byte order

alamb · 2022-04-17T11:25:26Z

I agree an error would be better than an invalid array. Later on if it is important we can add support for little endian

…

On Sat, Apr 16, 2022 at 12:28 PM Raphael Taylor-Davies < ***@***.***> wrote: The reference here <https://arrow.apache.org/docs/format/Columnar.html#byte-order-endianness> suggests it is acceptable for implementations to simply not support files with non-native byte order At first we will return an error when trying to read a Schema with an endianness that does not match the underlying system. The reference implementation is focused on Little Endian and provides tests for it. Eventually we may provide automatic conversion via byte swapping. I therefore think an acceptable approach would be to return an error if attempting to read a file with a non-native byte order — Reply to this email directly, view it on GitHub <#859 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXZMPL5TOCKNKGO2NKEQDVFLTCJANCNFSM5GTNAM4Q> . You are receiving this because you authored the thread.Message ID: ***@***.***>

alamb added the bug label Oct 24, 2021

viirya mentioned this issue Apr 16, 2022

Use littleendian arrow files for projection_should_work #1573

Merged

tustvold added the good first issue Good for newcomers label Apr 16, 2022

alamb mentioned this issue Jan 5, 2023

Support Arrow IPC Big Endian / Cross Endian #3459

Closed

pangiole mentioned this issue Jan 13, 2024

Result into error in case of endianness mismatches #5301

Merged

tustvold closed this as completed in #5301 Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPC big endian offsets are not translated #859

IPC big endian offsets are not translated #859

alamb commented Oct 24, 2021

tustvold commented Apr 16, 2022

alamb commented Apr 17, 2022 via email

IPC big endian offsets are not translated #859

IPC big endian offsets are not translated #859

Comments

alamb commented Oct 24, 2021

tustvold commented Apr 16, 2022

alamb commented Apr 17, 2022 via email