Closes #891. Subset handles duplicate cols consistently.

Rdatatable · Oct 15, 2014 · 8701d5a · 8701d5a
1 parent 54071d6
commit 8701d5a
Show file tree

Hide file tree

Showing 3 changed files with 10 additions and 2 deletions.
diff --git a/R/data.table.R b/R/data.table.R
@@ -1944,8 +1944,8 @@ subset.data.table <- function (x, subset, select, ...)
         nl <- as.list(seq_len(ncol(x)))
         setattr(nl,"names",names(x))
         vars <- eval(substitute(select), nl, parent.frame())  # e.g.  select=colF:colP
-        if (is.numeric(vars)) vars=names(x)[vars]
-        key.cols <- intersect(key.cols, vars) ## Only keep key.columns found in the select clause
+        # #891 fix - don't convert numeric vars to column names - will break when there are duplicate columns
+        key.cols <- intersect(key.cols, names(x)[vars]) ## Only keep key.columns found in the select clause
     }
 
     ans <- x[r, vars, with = FALSE]

diff --git a/README.md b/README.md
@@ -45,6 +45,8 @@
 
   13. `DT[, LHS := RHS]` with RHS is of the form `eval(parse(text = foo[1]))` referring to columns in `DT` is now handled properly. Closes [#880](https://github.com/Rdatatable/data.table/issues/880). Thanks to tyner.
 
+  14. `subset` handles extracting duplicate columns in consistency with data.table's rule - if a column name is duplicated, then accessing that column using column number should return that column, whereas accessing by column name (due to ambiguity) will always extract the first column. Closes [#891](https://github.com/Rdatatable/data.table/issues/891). Thanks to @jjzz.
+
 #### NOTES
 
   1. Clearer explanation of what `duplicated()` does (borrowed from base). Thanks to @matthieugomez for pointing out. Closes [#872](https://github.com/Rdatatable/data.table/issues/872).

diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -5397,6 +5397,12 @@ DT2 = data.table(start=tt[2], end=tt[2])
 setkey(DT2)
 test(1390.5, foverlaps(DT1, DT2, which=TRUE), data.table(xid=1:3, yid=as.integer(c(NA, 1, NA))))
 
+# Fix for #891. 'subset' and duplicate names.
+# duplicate column names rule - if column numbers, extract the right column. If names, extract always the first column
+DT = data.table(V1=1:5, V2=6:10, V3=11:15)
+setnames(DT, c("V1", "V2", "V1"))
+test(1391.1, subset(DT, select=c(3L,2L)), DT[, c(3L, 2L), with=FALSE])
+test(1391.2, subset(DT, select=c("V2", "V1")), DT[, c("V2", "V1"), with=FALSE])
 
 ##########################