Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect setkey result #3766

Closed
kirillmayantsev opened this issue Aug 13, 2019 · 6 comments · Fixed by #3768
Closed

Incorrect setkey result #3766

kirillmayantsev opened this issue Aug 13, 2019 · 6 comments · Fixed by #3768
Labels
Milestone

Comments

@kirillmayantsev
Copy link

Hello!

Minimal reproducible example:

library(data.table)
x = data.table(a = 1, b = c(1, 4, 2, 3), c = c(4, 3, 2, 1))
y = x[a == 1, .(d = paste(b, c, sep = "_"), b, b2 = b, c)]
y
setkey(y, d)
y

> y
     d b b2 c
1: 1_4 1  1 4
2: 4_3 4  4 3
3: 2_2 2  2 2
4: 3_1 3  3 1
> setkey(y, d)
> y
     d b b2 c
1: 1_4 1  1 4
2: 2_2 3  3 2
3: 3_1 4  4 1
4: 4_3 2  2 3

After setting a key the order of "b" and "b2" columns is incorrect.
There are 3 "workarounds" which help to keep "y" data.table consistent:

  1. Remove filtration ("a == 1")
  2. Remove column duplication ("b2 = b")
  3. Assign to "b2" a copy of "b" (as in example "b2" has the same reference as "b" I guess)

But still the described behavior seems to be a bug.
This issue could be related to #3496

Output of sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251    LC_MONETARY=Russian_Russia.1251
[4] LC_NUMERIC=C                    LC_TIME=Russian_Russia.1251    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.3

loaded via a namespace (and not attached):
[1] compiler_3.5.1 tools_3.5.1    yaml_2.2.0    
@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 14, 2019

I think the problem is b and b2 are the same memory:

x = data.table(a = 1, b = c(1, 4, 2, 3), c = c(4, 3, 2, 1))
y = x[a == 1, .(d = paste(b, c, sep = "_"), b, b2 = b, c)]

identical(address(y$b), address(y$b))
# [1] TRUE

So when sorting happens, it sorts b twice.

So this works:

y = x[a == 1, .(d = paste(b, c, sep = "_"), b, b2 = copy(b), c)]
setkey(y, d)

I'm not sure if this is expected behavior, or a bug in [, or a bug in setkey.

@MichaelChirico
Copy link
Member

Therefore I think it's actually the same root issue as the one you referenced, #3496:

#3496 (comment)

@kirillmayantsev
Copy link
Author

Yes, this makes sense. The only thing I can't understand is why the code without filtration works:

y = x[, .(d = paste(b, c, sep = "_"), b, b2 = b, c)]
setkey(y, d)

Probably the essence of the bug is in [.

@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 15, 2019

OK I've found why the code without filtration works:

https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L1271-L1282

if (is.null(irows)) {
  if (!is.list(jval)) { # performance improvement when i-arg is S4, but not list, #1438, Thanks @DCEmilberg.
    jcpy = address(jval) %in% vapply_1c(SDenv$.SD, address) # %chin% errors when RHS is list()
    if (jcpy) jval = copy(jval)
  } else if (address(jval) == address(SDenv$.SD)) {
    jval = copy(jval)
  } else if ( length(jcpy <- which(vapply_1c(jval, address) %in% vapply_1c(SDenv, address))) ) {
    for (jidx in jcpy) jval[[jidx]] = copy(jval[[jidx]])
  } else if (is.call(jsub) && jsub[[1L]] == "get" && is.list(jval)) {
    jval = copy(jval) # fix for #1212
  }
}

If i is empty, some checks are performed to see if there needs to be a copy.

@mattdowle or @arunsrinivasan maybe you could help to understand why this check is not done the !is.null(i) case?

I see a few referenced issues and comments:

#1212
#921
#617

@shrektan
Copy link
Member

The column deep-copy / shallow-copy behavior does seem inconsistent... @MichaelChirico Do you think it's a bug or a desired?

library(data.table)
x <- data.table(col1 = 1:5, col2 = 1:5)

v <- x[col1 %in% c(1:5), .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] TRUE

v <- x[1, .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] TRUE

v <- x[TRUE, .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] FALSE

v <- x[, .(A = col2, B = col2)]
identical(address(v$A), address(v$B))
#> [1] FALSE

Created on 2019-08-16 by the reprex package (v0.2.1)

@MichaelChirico
Copy link
Member

I'm not sure; see #3768. Based on that I'm only surprised by the 3rd example. I guess i=TRUE is intercepted and turned to NULL somewhere else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants