Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct nullable string validity map treatment #517

Merged
merged 14 commits into from
Feb 20, 2023

Conversation

eddelbuettel
Copy link
Contributor

@eddelbuettel eddelbuettel commented Feb 9, 2023

Work on the TileDB SOMA package and its independent implementation of the read-path (relying on Arrow buffers) revealed that we encoded missingness incorrectly for nullable strings. (This happened for read and write so round-turn tests passed. This is also not a concern for numeric values were encoded correctly but only concerns strings.)

This PR corrects this. It also tightens one test on deletion by switching to a triple || condition.

Example from https://app.shortcut.com/tiledb-inc/story/25027/r-unable-to-write-arrow-field-with-nullable-strings below the fold.

#!/usr/bin/env Rscript

suppressMessages({
    library(arrow)
    library(tiledbsoma)
})

if (getwd() == "/home/edd") setwd("~/git/tiledb-adhoc/soma/sc25027")

#uri <- tempfile()
uri <- "r_soma_2"
if (dir.exists(uri)) unlink(uri, recursive=TRUE)
sch <- arrow::schema(arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),
                     arrow::field("str_not_nullable", arrow::utf8(), nullable = FALSE),
                     arrow::field("str_nullable", arrow::utf8(), nullable = TRUE))

tbl0 <- arrow::arrow_table(soma_joinid = seq_along(letters[1:10]),
                           str_not_nullable = letters[1:10],
                           str_nullable = c(letters[1:4], NA, letters[6:8], NA, letters[10]),
                           schema = sch)
cat("== Before write:\n")
print(dplyr::collect(tbl0))
sdf <- SOMADataFrame$new(uri)
sdf$create(sch, index_column_names = "soma_joinid")
sdf$write(tbl0)
tbl1 <- sdf$read()
cat("== After read:\n")
print(dplyr::collect(tbl1))
all.equal(tbl1$str_not_nullable$as_vector(), tbl0$str_not_nullable$as_vector())
all.equal(tbl1$str_nullable$as_vector(), tbl0$str_nullable$as_vector())

generates

> suppressMessages({
+ library(arrow)
+ library(tiledbsoma)
+ })
> if (getwd() == "/home/edd") setwd("~/git/tiledb-adhoc/soma/sc25027")
> uri <- "r_soma_2"
> if (dir.exists(uri)) unlink(uri, recursive=TRUE)
> sch <- arrow::schema(arrow::field("soma_joinid", arrow::int64(), nullable = FALSE),
+ arrow::field("str_not_nullable", arrow::utf8(), nullable = FALSE),
+ arrow::field("str_nullable", arrow::utf8(), nullable = TRUE))
> tbl0 <- arrow::arrow_table(soma_joinid = seq_along(letters[1:10]),
+ str_not_nullable = letters[1:10],
+ str_nullable = c(letters[1:4], NA, letters[6:8], NA, letters[10]),
+ schema = sch)
> cat("== Before write:\n")
== Before write:
> print(dplyr::collect(tbl0))
# A tibble: 10 × 3
   soma_joinid str_not_nullable str_nullable
         <int> <chr>            <chr>       
 1           1 a                a           
 2           2 b                b           
 3           3 c                c           
 4           4 d                d           
 5           5 e                NA          
 6           6 f                f           
 7           7 g                g           
 8           8 h                h           
 9           9 i                NA          
10          10 j                j           
> sdf <- SOMADataFrame$new(uri)
> sdf$create(sch, index_column_names = "soma_joinid")
> sdf$write(tbl0)
> tbl1 <- sdf$read()
> cat("== After read:\n")
== After read:
> print(dplyr::collect(tbl1))
# A tibble: 10 × 3
   soma_joinid str_not_nullable str_nullable
         <int> <chr>            <chr>       
 1           1 a                a           
 2           2 b                b           
 3           3 c                c           
 4           4 d                d           
 5           5 e                NA          
 6           6 f                f           
 7           7 g                g           
 8           8 h                h           
 9           9 i                NA          
10          10 j                j           
> all.equal(tbl1$str_not_nullable$as_vector(), tbl0$str_not_nullable$as_vector())
[1] TRUE
> all.equal(tbl1$str_nullable$as_vector(), tbl0$str_nullable$as_vector())
[1] TRUE
> print(tbl0$str_nullable$as_vector())
 [1] "a" "b" "c" "d" NA  "f" "g" "h" NA  "j"
> print(tbl1$str_nullable$as_vector())
 [1] "a" "b" "c" "d" NA  "f" "g" "h" NA  "j"
> 

and Python is happy too:

$ ./arrow_nullable.py
Reading R back...
pyarrow.Table
soma_joinid: int64
str_not_nullable: large_string
str_nullable: large_string
----
soma_joinid: [[1,2,3,4,5,6,7,8,9,10]]
str_not_nullable: [["a","b","c","d","e","f","g","h","i","j"]]
str_nullable: [["a","b","c","d",null,"f","g","h",null,"j"]]
$

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #25027: R unable to write arrow field with nullable strings.

Copy link
Member

@ihnorton ihnorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to draft pending discussion.

@ihnorton ihnorton marked this pull request as draft February 9, 2023 11:18
@eddelbuettel eddelbuettel force-pushed the de/sc-25027/nullable_string_validity branch from b5bfc23 to 9373b19 Compare February 17, 2023 02:17
@eddelbuettel eddelbuettel marked this pull request as ready for review February 17, 2023 14:46
@eddelbuettel eddelbuettel force-pushed the de/sc-25027/nullable_string_validity branch from 42094c9 to 00df835 Compare February 17, 2023 15:02
Copy link
Member

@ihnorton ihnorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion, let's

  • remove use of the copy-then-delete in the conversion script
  • add copying of metadata from the old to new array

Otherwise LGTM. Let's get an ACK from @Shelnutt2 as well.

@eddelbuettel eddelbuettel force-pushed the de/sc-25027/nullable_string_validity branch from 3fa3de3 to b70ced5 Compare February 19, 2023 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants