Incorrect setkey result #3766

kirillmayantsev · 2019-08-13T22:55:26Z

Hello!

Minimal reproducible example:

library(data.table)
x = data.table(a = 1, b = c(1, 4, 2, 3), c = c(4, 3, 2, 1))
y = x[a == 1, .(d = paste(b, c, sep = "_"), b, b2 = b, c)]
y
setkey(y, d)
y

> y
     d b b2 c
1: 1_4 1  1 4
2: 4_3 4  4 3
3: 2_2 2  2 2
4: 3_1 3  3 1
> setkey(y, d)
> y
     d b b2 c
1: 1_4 1  1 4
2: 2_2 3  3 2
3: 3_1 4  4 1
4: 4_3 2  2 3

After setting a key the order of "b" and "b2" columns is incorrect.
There are 3 "workarounds" which help to keep "y" data.table consistent:

Remove filtration ("a == 1")
Remove column duplication ("b2 = b")
Assign to "b2" a copy of "b" (as in example "b2" has the same reference as "b" I guess)

But still the described behavior seems to be a bug.
This issue could be related to #3496

Output of sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251    LC_MONETARY=Russian_Russia.1251
[4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.3

loaded via a namespace (and not attached):
[1] compiler_3.5.1 tools_3.5.1    yaml_2.2.0

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2019-08-14T03:23:59Z

I think the problem is b and b2 are the same memory:

x = data.table(a = 1, b = c(1, 4, 2, 3), c = c(4, 3, 2, 1))
y = x[a == 1, .(d = paste(b, c, sep = "_"), b, b2 = b, c)]

identical(address(y$b), address(y$b))
# [1] TRUE

So when sorting happens, it sorts b twice.

So this works:

y = x[a == 1, .(d = paste(b, c, sep = "_"), b, b2 = copy(b), c)]
setkey(y, d)

I'm not sure if this is expected behavior, or a bug in [, or a bug in setkey.

MichaelChirico · 2019-08-14T03:31:49Z

Therefore I think it's actually the same root issue as the one you referenced, #3496:

#3496 (comment)

kirillmayantsev · 2019-08-14T09:16:35Z

Yes, this makes sense. The only thing I can't understand is why the code without filtration works:

y = x[, .(d = paste(b, c, sep = "_"), b, b2 = b, c)]
setkey(y, d)

Probably the essence of the bug is in [.

MichaelChirico · 2019-08-15T14:41:07Z

OK I've found why the code without filtration works:

https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L1271-L1282

if (is.null(irows)) {
  if (!is.list(jval)) { # performance improvement when i-arg is S4, but not list, #1438, Thanks @DCEmilberg.
    jcpy = address(jval) %in% vapply_1c(SDenv$.SD, address) # %chin% errors when RHS is list()
    if (jcpy) jval = copy(jval)
  } else if (address(jval) == address(SDenv$.SD)) {
    jval = copy(jval)
  } else if ( length(jcpy <- which(vapply_1c(jval, address) %in% vapply_1c(SDenv, address))) ) {
    for (jidx in jcpy) jval[[jidx]] = copy(jval[[jidx]])
  } else if (is.call(jsub) && jsub[[1L]] == "get" && is.list(jval)) {
    jval = copy(jval) # fix for #1212
  }
}

If i is empty, some checks are performed to see if there needs to be a copy.

@mattdowle or @arunsrinivasan maybe you could help to understand why this check is not done the !is.null(i) case?

I see a few referenced issues and comments:

#1212
#921
#617

…g identical columns vestigial

shrektan · 2019-08-16T11:06:43Z

The column deep-copy / shallow-copy behavior does seem inconsistent... @MichaelChirico Do you think it's a bug or a desired?

library(data.table)
x <- data.table(col1 = 1:5, col2 = 1:5)

v <- x[col1 %in% c(1:5), .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] TRUE

v <- x[1, .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] TRUE

v <- x[TRUE, .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] FALSE

v <- x[, .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] FALSE

^{Created on 2019-08-16 by the reprex package (v0.2.1)}

MichaelChirico · 2019-08-16T11:10:57Z

I'm not sure; see #3768. Based on that I'm only surprised by the 3rd example. I guess i=TRUE is intercepted and turned to NULL somewhere else.

MichaelChirico added the bug label Aug 14, 2019

MichaelChirico mentioned this issue Aug 15, 2019

copy shared columns before setkey #3768

Merged

MichaelChirico pushed a commit that referenced this issue Aug 15, 2019

Closes #3496 and #3766 -- detect & block setkey from double-reorderin…

3b3239c

…g identical columns vestigial

mattdowle added this to the 1.12.4 milestone Aug 26, 2019

mattdowle closed this as completed in #3768 Aug 27, 2019

mattdowle mentioned this issue Sep 18, 2019

Assignment operator := changes the body of a wrapper function #3890

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect setkey result #3766

Incorrect setkey result #3766

kirillmayantsev commented Aug 13, 2019

MichaelChirico commented Aug 14, 2019 •

edited

Loading

MichaelChirico commented Aug 14, 2019

kirillmayantsev commented Aug 14, 2019

MichaelChirico commented Aug 15, 2019 •

edited

Loading

shrektan commented Aug 16, 2019

MichaelChirico commented Aug 16, 2019

Incorrect setkey result #3766

Incorrect setkey result #3766

Comments

kirillmayantsev commented Aug 13, 2019

MichaelChirico commented Aug 14, 2019 • edited Loading

MichaelChirico commented Aug 14, 2019

kirillmayantsev commented Aug 14, 2019

MichaelChirico commented Aug 15, 2019 • edited Loading

shrektan commented Aug 16, 2019

MichaelChirico commented Aug 16, 2019

MichaelChirico commented Aug 14, 2019 •

edited

Loading

MichaelChirico commented Aug 15, 2019 •

edited

Loading