[r/python] Python `str`, R `character`, and Python `bytes` are cast to large variants #3507

mojaveazure · 2024-12-20T19:58:51Z

When writing a meta data column consisting of a str or bytes in Python, or character in R, types are cast to large variants when reading back (Arrow Large UTF8 for str and character, Arrow Large Binary for bytes); I don't know if this is intentional, but if it is this behaivor should be documented somewhere

R example:

> df <- data.frame(soma_joinid = bit64::seq.integer64(0L, 99L), str = character(100L))
> tbl <- arrow::as_arrow_table(df)
> sdf <- SOMADataFrameCreate("df-raw", schema = tbl$schema, domain = list(soma_joinid = c(0L, 100L)))
> sdf$write(tbl)
> sdf$close()
> sdf <- SOMADataFrameOpen(uri)
> sdf$read()$concat()
[1] Table
100 rows x 3 columns
$soma_joinid <int64 not null>
$str <large_string>

The text was updated successfully, but these errors were encountered:

johnkerl · 2024-12-20T20:23:44Z

@mojaveazure this is 100% intentional and is documented at least in various comments around the codebase (which I can collect for you at some point). Agreed the documentation needs to be clearer / more centralized.

More info later, but for now:

Arrow has string / binary which have 32-bit offsets and large_string / large_binary which have 64-bit offsets
TileDB Embedded only has 64-bit offsets
We decided collectively and as a team that on outgest we would only produce the large variants

mojaveazure · 2024-12-20T22:13:42Z

this is 100% intentional

Great, as long as it's intentional and not a bug then I'm 100% on-board. But yeah, more documentation, especially user-facing so that users aren't confused as to why it's one type going in and one type coming out, would be great

mojaveazure mentioned this issue Dec 20, 2024

[r/python] Standardize mappings of native, Arrow, and TileDB types #3501

Open

johnkerl added the documentation Improvements or additions to documentation label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r/python] Python `str`, R `character`, and Python `bytes` are cast to large variants #3507

[r/python] Python `str`, R `character`, and Python `bytes` are cast to large variants #3507

mojaveazure commented Dec 20, 2024

johnkerl commented Dec 20, 2024

mojaveazure commented Dec 20, 2024

[r/python] Python str, R character, and Python bytes are cast to large variants #3507

[r/python] Python str, R character, and Python bytes are cast to large variants #3507

Comments

mojaveazure commented Dec 20, 2024

johnkerl commented Dec 20, 2024

mojaveazure commented Dec 20, 2024

[r/python] Python `str`, R `character`, and Python `bytes` are cast to large variants #3507

[r/python] Python `str`, R `character`, and Python `bytes` are cast to large variants #3507