Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and `table()` results #1734

m-dz · 2016-06-08T15:31:41Z

Description

First, I am not sure if this is not a desired behaviour, but if, it is a bit surprising and not clearly explained.

Doing some data cleansing I have encountered a strange situation, where base::table() function returned a completely unexpected result when used on a data.table subsetted with i. After some (long) time I have tracked this issue down to the secondary keys - please see code example 1 below.

It looks like removing missing values with data.table::na.omit() function preserves the secondary keys, whereas doing so by subsetting with !is.na() sets the secondary one to a NULL and preserves only the primary one and this interferes with base::table() results. I have tried to reproduce the same behaviour without the missing values part, please see code example 2 below, by manually setting primary and secondary keys, but this time base::table() results were as expected, so the problem is (probably) caused by something more hidden.

Code example 1 (with the error)

require(data.table)

sessionInfo()
# R version 3.2.5 (2016-04-14)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.6
# 
# loaded via a namespace (and not attached):
# [1] rsconnect_0.4.3 tools_3.2.5     chron_2.3-47   

options(datatable.verbose = TRUE)



### Create a data.table with some values in column C missing

set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                   B = letters[sample(5,10000,replace = TRUE)],
                   C = ifelse(runif(10000) < 0.05, NA, "ignore"))

### Set keys (to reproduce the error)

setkey(data_table, B)
set2key(data_table, A)

### Remove missing values

data_table_naomit <- na.omit(data_table, cols = "C")
data_table_isna <- data_table[!is.na(C), ]
tracemem(data_table_naomit)
tracemem(data_table_isna)

all.equal(data_table_naomit, data_table_isna)

### Check keys, secondary keys (indexes) differ

key(data_table_naomit)   # "B"
key(data_table_isna)   # "B"
key2(data_table_naomit)  # "A"
key2(data_table_isna)  # NULL

### Get completely different selection and table outcomes

identical(data_table_naomit[A == "a", A], data_table_isna[A == "a", A])

table(data_table_naomit[A == "a", A])
# Using existing index 'A'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#  a   b   c   d   e 
#99  87  93 107  79 

table(data_table_isna[A == "a", A])
# Using existing index 'A'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#    a 
#1855

Code example 2 (everything as expected)

require(data.table)

sessionInfo()
# R version 3.2.5 (2016-04-14)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.6
# 
# loaded via a namespace (and not attached):
# [1] rsconnect_0.4.3 tools_3.2.5     chron_2.3-47   

options(datatable.verbose = TRUE)



### Create a data.table with some values in column C missing

set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                         B = letters[sample(5,10000,replace = TRUE)])
data_table_2 <- copy(data_table)
tracemem(data_table)
tracemem(data_table_2)

### Set keys (to reproduce the error)

setkey(data_table, B)
setkey(data_table_2, B)
set2key(data_table, A)
set2key(data_table_2, NULL)

### Remove missing values

### Get completely different selection and table outcomes

key(data_table)   # "B"
key(data_table_2)   # "B"
key2(data_table)  # "A"
key2(data_table_2)  # NULL

table(data_table[A == "a", A])
# Using existing index 'A'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#    a 
#1971 
table(data_table_2[A == "a", A])
# Creating new index 'A'
# forder took 0 sec
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: A 
# 
#    a 
#1971

If there is anything else I can provide please let me know.

The text was updated successfully, but these errors were encountered:

jangorecki · 2016-06-08T16:28:01Z

You should use recent development version for reporting issues
First chunk of does not produce the error, I believe you wanted stopifnot(identical(.)) here
It appears data.table:::na.omit.data.table is not removing the index, but it should
Thanks for the report

Simplified code:

library(data.table)
set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                         B = letters[sample(5,10000,replace = TRUE)],
                         C = ifelse(runif(10000) < 0.05, NA, "ignore"))
setindex(data_table, A) # v1.9.7
data_table_naomit <- na.omit(data_table, cols = "C")
data_table_naomit[A == "a", .N, A]
#   A  N
#1: a  8
#2: d 13
#3: c 11
#4: b  9
#5: e 15
data_table_naomit[(A == "a"), .N, A] # force vector scan
#   A    N
#1: a 1855

jangorecki · 2016-06-08T16:56:34Z

Using latest HEAD (should be available in devel repo in around 15 minutes - once CI finish)

library(data.table)
set.seed(2016)
data_table <- data.table(A = letters[sample(5,10000,replace = TRUE)],
                         B = letters[sample(5,10000,replace = TRUE)],
                         C = ifelse(runif(10000) < 0.05, NA, "ignore"))
setindex(data_table, A) # v1.9.7
data_table_naomit <- na.omit(data_table, cols = "C")
data_table_naomit[A == "a", .N, A]
#   A    N
#1: a 1855

m-dz · 2016-06-10T09:10:03Z

Thank you @jangorecki for sorting this out and apologies for any inconvenience with my report and code examples.

May I ask you what HEAD stands for?

jangorecki · 2016-06-10T09:21:03Z

@m-dz no problem, just pointing out good practices. HEAD is latest change in git repository, master branch in this case. More in What is HEAD in Git?

m-dz · 2016-06-10T10:39:55Z

Thank you!

m-dz changed the title ~~Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A])~~ Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and table() results Jun 8, 2016

jangorecki added the bug label Jun 8, 2016

jangorecki self-assigned this Jun 8, 2016

jangorecki closed this as completed in b79de43 Jun 8, 2016

mattdowle mentioned this issue Feb 6, 2018

Speedup of unique.data.table #2474

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and `table()` results #1734

Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and `table()` results #1734

m-dz commented Jun 8, 2016 •

edited by arunsrinivasan

Loading

jangorecki commented Jun 8, 2016 •

edited

Loading

jangorecki commented Jun 8, 2016 •

edited

Loading

m-dz commented Jun 10, 2016 •

edited

Loading

jangorecki commented Jun 10, 2016 •

edited

Loading

m-dz commented Jun 10, 2016

Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and table() results #1734

Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and table() results #1734

Comments

m-dz commented Jun 8, 2016 • edited by arunsrinivasan Loading

Description

Code example 1 (with the error)

Code example 2 (everything as expected)

jangorecki commented Jun 8, 2016 • edited Loading

jangorecki commented Jun 8, 2016 • edited Loading

m-dz commented Jun 10, 2016 • edited Loading

jangorecki commented Jun 10, 2016 • edited Loading

m-dz commented Jun 10, 2016

Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and `table()` results #1734

Secondary key (key2, set2key) is interfering with subsetting (e.g. data_table[A == "a", A]) and `table()` results #1734

m-dz commented Jun 8, 2016 •

edited by arunsrinivasan

Loading

jangorecki commented Jun 8, 2016 •

edited

Loading

jangorecki commented Jun 8, 2016 •

edited

Loading

m-dz commented Jun 10, 2016 •

edited

Loading

jangorecki commented Jun 10, 2016 •

edited

Loading