Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r/python] Python str, R character, and Python bytes are cast to large variants #3507

Open
mojaveazure opened this issue Dec 20, 2024 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@mojaveazure
Copy link
Member

When writing a meta data column consisting of a str or bytes in Python, or character in R, types are cast to large variants when reading back (Arrow Large UTF8 for str and character, Arrow Large Binary for bytes); I don't know if this is intentional, but if it is this behaivor should be documented somewhere

R example:

> df <- data.frame(soma_joinid = bit64::seq.integer64(0L, 99L), str = character(100L))
> tbl <- arrow::as_arrow_table(df)
> sdf <- SOMADataFrameCreate("df-raw", schema = tbl$schema, domain = list(soma_joinid = c(0L, 100L)))
> sdf$write(tbl)
> sdf$close()
> sdf <- SOMADataFrameOpen(uri)
> sdf$read()$concat()
[1] Table
100 rows x 3 columns
$soma_joinid <int64 not null>
$str <large_string>
@johnkerl
Copy link
Member

@mojaveazure this is 100% intentional and is documented at least in various comments around the codebase (which I can collect for you at some point). Agreed the documentation needs to be clearer / more centralized.

More info later, but for now:

  • Arrow has string / binary which have 32-bit offsets and large_string / large_binary which have 64-bit offsets
  • TileDB Embedded only has 64-bit offsets
  • We decided collectively and as a team that on outgest we would only produce the large variants

@johnkerl johnkerl added the documentation Improvements or additions to documentation label Dec 20, 2024
@mojaveazure
Copy link
Member Author

this is 100% intentional

Great, as long as it's intentional and not a bug then I'm 100% on-board. But yeah, more documentation, especially user-facing so that users aren't confused as to why it's one type going in and one type coming out, would be great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants