Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't use index when selfref detects prior copy by another package #5084

Merged
merged 3 commits into from
Jul 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,9 @@

26. `melt()` now outputs scalar logical `NA` instead of `NULL` in rows corresponding to missing list columns, for consistency with non-list columns when using `na.rm=TRUE`, [#5053](https://github.com/Rdatatable/data.table/pull/5053). Thanks to Toby Dylan Hocking for the PR.

27. `as.data.frame(DT)`, `setDF(DT)` and `as.list(DT)` now remove the `"index"` attribute which contains any indices (a.k.a. secondary keys), as they already did for other `data.table`-only attributes such as the primary key stored in the `"sorted"` attribute. When indices were left intact, a subsequent subset or reorder of the `data.frame` by `data.frame`-code in base R or other packages would not update the indices, causing incorrect results if then converted back to `data.table`, [#4889](https://github.com/Rdatatable/data.table/issues/4889) [#5042](https://github.com/Rdatatable/data.table/issues/5042). Thanks @OfekShilon for the report and the PR.
27. `as.data.frame(DT)`, `setDF(DT)` and `as.list(DT)` now remove the `"index"` attribute which contains any indices (a.k.a. secondary keys), as they already did for other `data.table`-only attributes such as the primary key stored in the `"sorted"` attribute. When indices were left intact, a subsequent subset, assign, or reorder of the `data.frame` by `data.frame`-code in base R or other packages would not update the indices, causing incorrect results if then converted back to `data.table`, [#4889](https://github.com/Rdatatable/data.table/issues/4889). Thanks @OfekShilon for the report and the PR.

28. `dplyr::arrange(DT)` uses `vctrs::vec_slice` which retains `data.table`'s class but uses C to bypass `[` method dispatch and does not adjust `data.table`'s attributes containing the index row numbers, [#5042](https://github.com/Rdatatable/data.table/issues/5042). `data.table`'s long-standing `.internal.selfref` mechanism to detect such operations by other packages was not being checked by `data.table` when using indexes, causing `data.table` filters and joins to use invalid indexes and return incorrect results after a `dplyr::arrange(DT)`. Thanks to @Waldi73 for reporting; @avimallu, @tlapak, @MichaelChirico, @jangorecki and @hadley for investigating and suggestions; and @mattdowle for the PR. The intended way to use `data.table` is `data.table::setkey(DT, col1, col2, ...)` which reorders `DT` by reference in parallel, sets the primary key for automatic use by subsequent `data.table` queries, and permits rowname-like usage such as `DT["foo",]` which returns the now-contiguous-in-memory block of rows where the first column of `DT`'s key contains `"foo"`. Multi-column-rownames (i.e. a primary key of more than one column) can be looked up using `DT[.("foo",20210728L), ]`. Using `==` in `i` is also optimized to use the key or indices, if you prefer using column names explicitly and `==`. An alternative to `setkey(DT)` is returning a new ordered result using `DT[order(col1, col2, ...), ]`.

## NOTES

Expand Down
2 changes: 1 addition & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -2484,7 +2484,7 @@ copy = function(x) {
shallow = function(x, cols=NULL) {
if (!is.data.table(x))
stopf("x is not a data.table. Shallow copy is a copy of the vector of column pointers (only), so is only meaningful for data.table")
ans = .shallow(x, cols=cols, retain.key = TRUE)
ans = .shallow(x, cols=cols, retain.key=selfrefok(x)) # selfrefok for #5042
ans
}

Expand Down
14 changes: 11 additions & 3 deletions inst/tests/other.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ if (exists("test.data.table",.GlobalEnv,inherits=FALSE) ||
test = data.table:::test
INT = data.table:::INT

pkgs = c("ggplot2", "hexbin", "plyr", "caret", "xts", "gdata", "zoo", "nlme", "bit64", "knitr", "parallel")
pkgs = c("ggplot2", "hexbin", "plyr", "dplyr", "caret", "xts", "gdata", "zoo", "nlme", "bit64", "knitr", "parallel")
if (any(duplicated(pkgs))) stop("Packages defined to be loaded for integration tests in 'inst/tests/other.Rraw' contains duplicates.")

is.require = function(pkg) suppressWarnings(suppressMessages(isTRUE(require(pkg, character.only=TRUE, quietly=TRUE, warn.conflicts=FALSE))))
Expand Down Expand Up @@ -54,10 +54,18 @@ if (loaded[["ggplot2"]]) {
}

if (loaded[["plyr"]]) {
# Test key and indices are dropped when non-dt-aware packages (here, plyr) reorders rows of data.table.
# Test key and indices are dropped when non-dt-aware packages reorders rows using `[`
DT = data.table(a=1:10,b=1:2,key="a")
setindex(DT, b)
test(2, arrange(DT,b), data.table(a=INT(1,3,5,7,9,2,4,6,8,10),b=INT(1,1,1,1,1,2,2,2,2,2)))
test(2.1, plyr::arrange(DT,b), data.table(a=INT(1,3,5,7,9,2,4,6,8,10),b=INT(1,1,1,1,1,2,2,2,2,2)))
}

if (loaded[["dplyr"]]) {
# dplyr::arrange uses vctrs::vec_slice which is implemented in C and bypasses `[` dispatch; #5042
DT = data.table(A=c("b","c","a"), B=10:12)
setindex(DT, A)
DT2 = dplyr::arrange(DT, A)
test(2.2, DT2[A=="c"], data.table(A="c", B=11L))
}

if (FALSE) { # loaded[["reshape"]]
Expand Down