Skip to content

Commit

Permalink
Correct nullable string validity map treatment (#517)
Browse files Browse the repository at this point in the history
* Correct nullable string validity map treatment

* Add NEWS entry about breaking change [ci skip]

* Reads can now account for legacy validity mode on string buffer

* Add a test for legacy validity using an legacy array

* Adjust .Rbuildignore to allow sample data in

* Also set for shmen case

* Support legacy_validity writes

* Add converter function, script wrapper and test

* Updated news, moved script [ci skip]

* Conversion from (or to) legacy_validity copies with metadata

* Small metadata documentation edit

* Tweak bspm option for r-ci

* Identify dimension names and use as new dimensions

* Set bspm to version 0.3.10 for now
  • Loading branch information
eddelbuettel authored Feb 20, 2023
1 parent 6abcd6e commit e0fea76
Show file tree
Hide file tree
Showing 17 changed files with 280 additions and 32 deletions.
2 changes: 1 addition & 1 deletion .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
^\.Rproj\.user$
^\.travis\.yml$
^_pgkdown.yml
^.*\.tar\.gz$
^tiledb.*\.tar\.gz$
^docs
^_pkgdown.yml$
^inst/examples/quickstart_dense$
Expand Down
10 changes: 4 additions & 6 deletions .github/r-ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -192,9 +192,11 @@ BootstrapLinuxOptions() {
# InstallPandoc 'linux/debian/x86_64'
#fi
if [[ "${USE_BSPM}" != "FALSE" ]]; then
sudo Rscript --vanilla -e 'install.packages("bspm", repos="https://cran.r-project.org")'
#sudo Rscript --vanilla -e 'install.packages("bspm", repos="https://cran.r-project.org")'
sudo Rscript --vanilla -e 'remotes::install_url("https://cloud.r-project.org/src/contrib/Archive/bspm/bspm_0.3.10.tar.gz")'
echo "suppressMessages(bspm::enable())" | sudo tee --append /etc/R/Rprofile.site >/dev/null
echo "options(bspm.sudo=TRUE)" | sudo tee --append /etc/R/Rprofile.site >/dev/null
##--not needed with 0.3.10 echo "options(bspm.version.check=FALSE)" | sudo tee --append /etc/R/Rprofile.site >/dev/null
##--not needed here echo "options(bspm.sudo=TRUE)" | sudo tee --append /etc/R/Rprofile.site >/dev/null
fi
}

Expand Down Expand Up @@ -319,16 +321,12 @@ RBinaryInstall() {

InstallGithub() {
#EnsureDevtools

#echo "Installing GitHub packages: $@"
# Install the package.
#Rscript -e 'library(devtools); library(methods); install_github(commandArgs(TRUE), build_vignettes = FALSE)' "$@"
sudo Rscript -e 'remotes::install_github(commandArgs(TRUE))' "$@"
}

InstallDeps() {
#EnsureDevtools
#Rscript -e 'library(devtools); library(methods); install_deps(dependencies = TRUE)'
sudo Rscript -e 'remotes::install_deps(".")'
}

Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
* This release of the R package builds against [TileDB 2.14.1](https://github.com/TileDB-Inc/TileDB/releases/tag/2.14.1), and has also been tested against earlier releases as well as the development version (#502).
* Safer checking of `NAs` in `tiledb_config()` to support R 4.2 conditional lengths (#518, #519)

## Breaking Changes

* The validity map coding of nullable strings has been corrected: validity map values of one are now interpreted as valid/non-null for full compatibility with other TileDB projects. Previously written arrays with nullable strings can be read by setting the config option `r.legacy_validity_mode` to `true`; the option also permits to write to an older installation. A conversion helper script is provided in `scripts/legacy_validity_convert.r`. (#517)

## Improvements

* Attributes can now be created, written and read from in (explicit) UTF8 types (and CHAR and ASCII already behaved correctly with respect to utf8 data) (#510)
Expand Down
6 changes: 3 additions & 3 deletions R/Metadata.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2017-2021 TileDB Inc.
# Copyright (c) 2017-2023 TileDB Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -77,10 +77,10 @@ tiledb_put_metadata <- function(arr, key, val) {
}


##' Return a TileDB Array Metadata object given by key
##' Return all TileDB Array Metadata objects as a named list
##'
##' @param arr A TileDB Array object, or a character URI describing one
##' @return A object stored in the Metadata under the given key
##' @return A named list with all Metadata objects indexed by the given key
##' @export
tiledb_get_all_metadata <- function(arr) {
stopifnot(`Argument 'arr' must be a (dense or sparse) TileDB array` = .isArray(arr),
Expand Down
12 changes: 10 additions & 2 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -568,8 +568,16 @@ libtiledb_query_buffer_var_char_alloc_direct <- function(szoffsets, szdata, null
.Call(`_tiledb_libtiledb_query_buffer_var_char_alloc_direct`, szoffsets, szdata, nullable, cols)
}

libtiledb_query_buffer_var_char_create <- function(vec, nullable) {
.Call(`_tiledb_libtiledb_query_buffer_var_char_create`, vec, nullable)
libtiledb_query_buffer_var_char_get_legacy_validity_value <- function(ctx, validity_override = FALSE) {
.Call(`_tiledb_libtiledb_query_buffer_var_char_get_legacy_validity_value`, ctx, validity_override)
}

libtiledb_query_buffer_var_char_legacy_validity_mode <- function(ctx, buf, validity_override = FALSE) {
.Call(`_tiledb_libtiledb_query_buffer_var_char_legacy_validity_mode`, ctx, buf, validity_override)
}

libtiledb_query_buffer_var_char_create <- function(vec, nullable, legacy_validity = FALSE) {
.Call(`_tiledb_libtiledb_query_buffer_var_char_create`, vec, nullable, legacy_validity)
}

libtiledb_query_set_buffer_var_char <- function(query, attr, bufptr) {
Expand Down
6 changes: 4 additions & 2 deletions R/TileDBArray.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2017-2022 TileDB Inc.
# Copyright (c) 2017-2023 TileDB Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -836,6 +836,7 @@ setMethod("[", "tiledb_array",
if (type %in% c("CHAR", "ASCII", "UTF8")) {
spdl::debug("[getBuffer] '{}' allocating 'char' {} rows given budget of {}", name, resrv, memory_budget)
buf <- libtiledb_query_buffer_var_char_alloc_direct(resrv, memory_budget, nullable)
buf <- libtiledb_query_buffer_var_char_legacy_validity_mode(ctx@ptr, buf)
qryptr <- libtiledb_query_set_buffer_var_char(qryptr, name, buf)
buf
} else {
Expand Down Expand Up @@ -1242,14 +1243,15 @@ setMethod("[<-", "tiledb_array",
else { if (sparse) "UNORDERED" else "COL_MAJOR" })

buflist <- vector(mode="list", length=nc)
legacy_validity <- libtiledb_query_buffer_var_char_get_legacy_validity_value(ctx@ptr)

for (colnam in allnames) {
## when an index column is use this may be unordered to remap to position in 'nm' names
k <- match(colnam, nm)
if (alltypes[k] %in% c("CHAR", "ASCII", "UTF8")) { # variable length
txtvec <- as.character(value[[k]])
spdl::debug("[tiledb_array] '[<-' alloc char buffer {} '{}': {}", k, colnam, alltypes[k])
buflist[[k]] <- libtiledb_query_buffer_var_char_create(txtvec, allnullable[k])
buflist[[k]] <- libtiledb_query_buffer_var_char_create(txtvec, allnullable[k], legacy_validity)
qryptr <- libtiledb_query_set_buffer_var_char(qryptr, colnam, buflist[[k]])
} else {
col <- value[[k]]
Expand Down
76 changes: 75 additions & 1 deletion R/Utils.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2017-2022 TileDB Inc.
# Copyright (c) 2017-2023 TileDB Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -223,3 +223,77 @@ r_to_tiledb_type <- function(x) {
.assertArray <- function(arr) {
stopifnot(is(arr, "tiledb_sparse") || is(arr, "tiledb_dense") || is(arr, "tiledb_array"))
}

## conversion helper from (and to) legacy validity map for nullable strings
.legacy_validity <- function(inuri,
outdir = NULL,
fromlegacy = TRUE,
tolegacy = FALSE,
usetmp = FALSE,
verbose = FALSE,
debug = FALSE) {

stopifnot("'inuri' must be an existing directory" = dir.exists(inuri))

if (verbose)
cat("Running with tiledb R package version", format(packageVersion("tiledb")),
"and TileDB Core version", format(tiledb_version(TRUE)), "\n")

array <- basename(inuri)
if (debug) print(summary(tiledb_array(inuri, strings_as_factors=TRUE)[]))

newdir <- ""
if (isTRUE(usetmp)) newdir <- tempfile()
if (!is.null(outdir)) newdir <- outdir
if (newdir == "")
stop("If '--usetmp' is not given then '--out OUT' must be given.", call. = FALSE)

if (!dir.exists(newdir)) dir.create(newdir)
#res <- file.copy(inuri, newdir, recursive=TRUE)
newuri <- file.path(newdir, array)

arr <- tiledb_array(inuri)
attrlst <- attrs(schema(arr))
is_nullable_string <- function(x) datatype(x) %in% c("ASCII", "CHAR", "UTF8") &&
tiledb_attribute_get_nullable(x)
stringcols <- Filter(is_nullable_string, attrlst)
if (length(stringcols) == 0) {
stop("No string columns in array so nothing to do. Exiting.\n", call. = FALSE)
}
dimnames <- sapply(dimensions(domain(schema(arr))), name)

oldcfg <- cfg <- tiledb_config()
cfg["r.legacy_validity_mode"] <- if (fromlegacy) "true" else "false"
ctx <- tiledb_ctx(cfg)
dat <- tiledb_array(inuri, return_as="data.frame", strings_as_factors=TRUE)[]
if (debug) print(summary(dat))

arr <- tiledb_array(inuri)
arr <- tiledb_array_open(arr, "READ")
nmd <- tiledb_num_metadata(arr)
if (nmd > 0) metadatalist <- tiledb_get_all_metadata(arr)
if (debug) print(metadatalist)

cfg["r.legacy_validity_mode"] <- if (tolegacy) "true" else "false"
ctx <- tiledb_ctx(cfg)
fromDataFrame(dat, newuri, col_index=dimnames)

if (nmd > 0) {
arr <- tiledb_array(newuri)
arr <- tiledb_array_open(arr, "WRITE")
for (nm in names(metadatalist)) {
invisible(tiledb_put_metadata(arr, nm, metadatalist[[nm]]))
if (debug) print(metadatalist[[nm]])
}
invisible(tiledb_array_close(arr))
}

chk <- tiledb_array(newuri, strings_as_factors=TRUE)[]
if (debug) {
cat("Written back.\n")
print(summary(chk))
}
if (verbose) cat("Done.\n")
ctx <- tiledb_ctx(oldcfg) # reset
invisible()
}
1 change: 1 addition & 0 deletions inst/include/tiledb.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ struct var_length_char_buffer {
int32_t rows, cols; // dimension from subarray
bool nullable; // flag
std::vector<uint8_t> validity_map; // for nullable vectors
bool legacy_validity; // for legacy validity mode
};
typedef struct var_length_char_buffer vlc_buf_t;

Expand Down
Binary file added inst/sampledata/legacy_validity.tar.gz
Binary file not shown.
Binary file added inst/sampledata/legacy_write.tar.gz
Binary file not shown.
29 changes: 29 additions & 0 deletions inst/scripts/legacy_validity_convert.r
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/usr/bin/env Rscript

## load docopt and tiledb
suppressMessages({
library(docopt) # for command-line argument parsing and help generation
library(tiledb)
})

## configuration for docopt
doc <- "Usage: convert.t [-h] [-v] [-d] [-t] [-o OUTDIR | -u] INDIR
-o --out OUTDIR write converted array into OUTDIR, if not given conversion in place [default: ]
-u --usetmp do not require output directory and use a temporary directory [default: FALSE]
-t --tolegacy convert to (instead of from) legacy validity mode
-v --verbose show extra output while processing
-d --debug show extra debug information
-h --help show this help tex
"

opt <- docopt(doc) # docopt parsing
#if (opt$debug) print(opt)

tiledb:::.legacy_validity(inuri=opt$INDIR,
outdir=opt$out,
fromlegacy=!opt$tolegacy,
tolegacy=opt$tolegacy,
usetmp=opt$usetmp,
verbose=opt$verbose,
debug=opt$debug)
7 changes: 4 additions & 3 deletions inst/tinytest/test_query.R
Original file line number Diff line number Diff line change
Expand Up @@ -263,12 +263,13 @@ uri <- tempfile()
pp <- palmerpenguins::penguins
fromDataFrame(pp, uri, sparse = TRUE, col_index = c("species", "year"))

qc <- parse_query_condition(body_mass_g > 4000 && sex == "male")
qc <- parse_query_condition(body_mass_g > 4000 || island == "Biscoe" || sex == "male")
arr <- tiledb_array(uri)
qry <- tiledb_query(arr, "DELETE")
qry <- tiledb_query_set_condition(qry, qc)
tiledb_query_submit(qry)
tiledb_query_finalize(qry)

oo <- tiledb_array(uri, return_as="data.frame")[]
expect_equal(nrow(oo), 177) # instead of 344 pre-deletion
oo <- tiledb_array(uri, return_as="data.frame", strings_as_factors=TRUE)[]

expect_equal(nrow(oo), 84) # instead of 344 pre-deletion
68 changes: 68 additions & 0 deletions inst/tinytest/test_tiledbarray.R
Original file line number Diff line number Diff line change
Expand Up @@ -1472,3 +1472,71 @@ oo <- penguins
expect_equal(sum(is.na(oo$sex)), sum(is.na(pp$sex)))
expect_equal(sum(oo$sex == "male"), sum(pp$sex == "male"))
expect_equal(sum(oo$sex == "female"), sum(pp$sex == "female"))


## [214] legacy validity mode
tdir <- tempfile()
tgzfile <- system.file("sampledata", "legacy_validity.tar.gz", package="tiledb")
untar(tarfile = tgzfile, exdir = tdir)
uri <- file.path(tdir, "legacy_validity")
cfg <- tiledb_config()
oldcfg <- cfg
cfg["r.legacy_validity_mode"] <- "true"
ctx <- tiledb_ctx(cfg)
arr <- tiledb_array(uri, strings_as_factors=FALSE, return_as="data.frame")[]
expect_equal(dim(arr)[1], 10)
expect_equal(dim(arr)[2], 3)
expect_equivalent(arr, data.frame(key=1:10,
val1=c(letters[1:4], NA, letters[6:7], NA, letters[9:10]),
val2=LETTERS[1:10]))
expect_equal(arr$val1, c(letters[1:4], NA, letters[6:7], NA, letters[9:10]))
ctx <- tiledb_ctx(oldcfg) # reset config

## [218] test conversion with metadata
outdir <- tempfile()
dir.create(outdir)
tiledb:::.legacy_validity(uri, outdir, fromlegacy=TRUE)
outuri <- file.path(outdir, "legacy_validity")
chk <- tiledb_array(outuri, return_as="data.frame")[]
expect_equal(dim(arr)[1], 10)
expect_equal(dim(arr)[2], 3)
expect_equivalent(arr, data.frame(key=1:10,
val1=c(letters[1:4], NA, letters[6:7], NA, letters[9:10]),
val2=LETTERS[1:10]))
expect_equal(arr$val1, c(letters[1:4], NA, letters[6:7], NA, letters[9:10]))
arr <- tiledb_array(outuri)
arr <- tiledb_array_open(arr, "READ")
expect_equal(tiledb_num_metadata(arr), 2) # two sets of meta data
mdlst <- tiledb_get_all_metadata(arr)
expect_equal(mdlst[["data"]], c(123L, 456L, 789L))
expect_equal(mdlst[["text"]], "the quick brown fox")


## [225] test conversion: larger penguins example
tdir <- tempfile()
tgzfile <- system.file("sampledata", "legacy_write.tar.gz", package="tiledb")
untar(tarfile = tgzfile, exdir = tdir)
inuri <- file.path(tdir, "legacy_write", "penguins")

outdir <- tempfile()
dir.create(outdir)
cfg["r.legacy_validity_mode"] <- "false" # reset to no conversion to read 'before'
ctx <- tiledb_ctx(cfg)
before <- tiledb_array(inuri, strings_as_factors=TRUE)[]
expect_equal(sum(is.na(before$sex)), 333)

tiledb:::.legacy_validity(inuri, outdir, fromlegacy=TRUE)
outuri <- file.path(outdir, "penguins")
after <- tiledb_array(outuri, strings_as_factors=TRUE)[]
expect_equal(sum(is.na(after$sex)), 11)
for (col in colnames(before)[-c(1,8)]) # exclude __tiledb_rows and sex
expect_equal(before[[col]], after[[col]])

newout <- tempfile()
dir.create(newout)
tiledb:::.legacy_validity(outuri, newout, tolegacy=TRUE)
rvturi <- file.path(newout, "penguins")
revert <- tiledb_array(rvturi, strings_as_factors=TRUE)[]
expect_equal(sum(is.na(revert$sex)), 333)
for (col in colnames(before)[-c(1,8)]) # exclude __tiledb_rows
expect_equal(before[[col]], revert[[col]])
6 changes: 3 additions & 3 deletions man/tiledb_get_all_metadata.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit e0fea76

Please sign in to comment.