Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and table() results #1734

Closed
m-dz opened this issue Jun 8, 2016 · 5 comments
Assignees
Labels

Comments

@m-dz
Copy link

m-dz commented Jun 8, 2016

Description

First, I am not sure if this is not a desired behaviour, but if, it is a bit surprising and not clearly explained.

Doing some data cleansing I have encountered a strange situation, where base::table() function returned a completely unexpected result when used on a data.table subsetted with i. After some (long) time I have tracked this issue down to the secondary keys - please see code example 1 below.

It looks like removing missing values with data.table::na.omit() function preserves the secondary keys, whereas doing so by subsetting with !is.na() sets the secondary one to a NULL and preserves only the primary one and this interferes with base::table() results. I have tried to reproduce the same behaviour without the missing values part, please see code example 2 below, by manually setting primary and secondary keys, but this time base::table() results were as expected, so the problem is (probably) caused by something more hidden.

Code example 1 (with the error)

require(data.table)

sessionInfo()
# R version 3.2.5 (2016-04-14)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.6
# 
# loaded via a namespace (and not attached):
# [1] rsconnect_0.4.3 tools_3.2.5     chron_2.3-47   

options(datatable.verbose = TRUE)



### Create a data.table with some values in column C missing

set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                   B = letters[sample(5,10000,replace = TRUE)],
                   C = ifelse(runif(10000) < 0.05, NA, "ignore"))

### Set keys (to reproduce the error)

setkey(data_table, B)
set2key(data_table, A)

### Remove missing values

data_table_naomit <- na.omit(data_table, cols = "C")
data_table_isna <- data_table[!is.na(C), ]
tracemem(data_table_naomit)
tracemem(data_table_isna)

all.equal(data_table_naomit, data_table_isna)

### Check keys, secondary keys (indexes) differ

key(data_table_naomit)   # "B"
key(data_table_isna)   # "B"
key2(data_table_naomit)  # "A"
key2(data_table_isna)  # NULL

### Get completely different selection and table outcomes

identical(data_table_naomit[A == "a", A], data_table_isna[A == "a", A])

table(data_table_naomit[A == "a", A])
# Using existing index 'A'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#  a   b   c   d   e 
#99  87  93 107  79 

table(data_table_isna[A == "a", A])
# Using existing index 'A'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#    a 
#1855

Code example 2 (everything as expected)

require(data.table)

sessionInfo()
# R version 3.2.5 (2016-04-14)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.6
# 
# loaded via a namespace (and not attached):
# [1] rsconnect_0.4.3 tools_3.2.5     chron_2.3-47   

options(datatable.verbose = TRUE)



### Create a data.table with some values in column C missing

set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                         B = letters[sample(5,10000,replace = TRUE)])
data_table_2 <- copy(data_table)
tracemem(data_table)
tracemem(data_table_2)

### Set keys (to reproduce the error)

setkey(data_table, B)
setkey(data_table_2, B)
set2key(data_table, A)
set2key(data_table_2, NULL)

### Remove missing values

### Get completely different selection and table outcomes

key(data_table)   # "B"
key(data_table_2)   # "B"
key2(data_table)  # "A"
key2(data_table_2)  # NULL

table(data_table[A == "a", A])
# Using existing index 'A'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#    a 
#1971 
table(data_table_2[A == "a", A])
# Creating new index 'A'
# forder took 0 sec
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#    a 
#1971 

If there is anything else I can provide please let me know.

@m-dz m-dz changed the title Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and table() results Jun 8, 2016
@jangorecki
Copy link
Member

jangorecki commented Jun 8, 2016

  1. You should use recent development version for reporting issues
  2. First chunk of does not produce the error, I believe you wanted stopifnot(identical(.)) here
  3. It appears data.table:::na.omit.data.table is not removing the index, but it should
  4. Thanks for the report

Simplified code:

library(data.table)
set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                         B = letters[sample(5,10000,replace = TRUE)],
                         C = ifelse(runif(10000) < 0.05, NA, "ignore"))
setindex(data_table, A) # v1.9.7
data_table_naomit <- na.omit(data_table, cols = "C")
data_table_naomit[A == "a", .N, A]
#   A  N
#1: a  8
#2: d 13
#3: c 11
#4: b  9
#5: e 15
data_table_naomit[(A == "a"), .N, A] # force vector scan
#   A    N
#1: a 1855

@jangorecki jangorecki added the bug label Jun 8, 2016
@jangorecki jangorecki self-assigned this Jun 8, 2016
@jangorecki
Copy link
Member

jangorecki commented Jun 8, 2016

Using latest HEAD (should be available in devel repo in around 15 minutes - once CI finish)

library(data.table)
set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                         B = letters[sample(5,10000,replace = TRUE)],
                         C = ifelse(runif(10000) < 0.05, NA, "ignore"))
setindex(data_table, A) # v1.9.7
data_table_naomit <- na.omit(data_table, cols = "C")
data_table_naomit[A == "a", .N, A]
#   A    N
#1: a 1855

@m-dz
Copy link
Author

m-dz commented Jun 10, 2016

Thank you @jangorecki for sorting this out and apologies for any inconvenience with my report and code examples.

May I ask you what HEAD stands for?

@jangorecki
Copy link
Member

jangorecki commented Jun 10, 2016

@m-dz no problem, just pointing out good practices. HEAD is latest change in git repository, master branch in this case. More in What is HEAD in Git?

@m-dz
Copy link
Author

m-dz commented Jun 10, 2016

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants