tbl_df[, i] shouldn't coerce to vector #587

hadley · 2014-09-08T19:07:25Z

No description provided.

kevinushey · 2014-09-08T21:01:49Z

Would it be sufficient to just delegate to [.data.frame w/ drop = FALSE and coerce back to tbl_df at the end? E.g.

`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  as.tbl(`[.data.frame`(x, i, j, drop = drop))
}

hadley · 2014-09-09T00:04:18Z

Yes exactly. I think we might already has a method for grouped objects.

hadley · 2014-09-12T17:05:03Z

Ugh, I give up for now. Here are some test cases

test_that("tbl_df doesn't drop by default", {
  df <- data_frame(x = 1:4)

  expect_equal(df[], df)
  expect_equal(df[1], df)
  expect_equal(df[, 1], df)
  expect_equal(df[1:2, ], df)
})

the obvious stuff like @kevinushey's idea doesn't work because of the special weirdness of [.

kevinushey · 2014-09-12T18:00:38Z

Yuck. This is pretty much because of these being different:

df <- data.frame(x = 1:4)
`[.data.frame`(df, 1)
`[.data.frame`(df, 1, )

Is there any way to differentiate between those two cases in a sensible way?

kevinushey · 2014-09-12T18:22:44Z

Here's something gross:

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  ## Needed to distinguish between x[i] and x[i, ]
  num_args <- nargs()

  ## If we have 3 or more arguments, can use regular dispatch
  if (num_args >= 3L) {
    return(as.tbl(`[.data.frame`(x, i, j, drop)))
  }

  ## Otherwise, dispatch depending on missingness
  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(as.tbl(`[.data.frame`(x, , i, drop)))
    }
  }
  as.tbl(`[.data.frame`(x, i, j, drop))
}

It passes these tests:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

Part of me thinks we should just roll our own subsetting methods though, rather than delegating to [.data.frame.

hadley · 2014-09-12T19:16:05Z

Does your code work with an explicit drop?

Reimplementing subsetting for data frames from scratch would be a lot of work :(

kevinushey · 2014-09-12T20:52:07Z

Iteration two (maybe I'll put this in a PR if this seems like the route we want to go):

maybe_tbl_df <- function(object) {
  if (is.data.frame(object)) {
    tbl_df(object)
  } else {
    object
  }
}

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {

  num_args <- nargs() - !missing(drop)
  if (num_args >= 3L) {
    return(maybe_tbl_df(`[.data.frame`(x, i, j, drop)))
  }

  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(maybe_tbl_df(`[.data.frame`(x, , i, drop)))
    }
  }
  maybe_tbl_df(`[.data.frame`(x, i, j, drop))
}

passes:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl, df)

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

check(tbl[drop = FALSE], df[drop = FALSE])
check(tbl[1, drop = FALSE], df[1, drop = FALSE])
check(tbl[, 1, drop = FALSE], df[, 1, drop = FALSE])
check(tbl[1:2, , drop = FALSE], df[1:2, , drop = FALSE])

check(tbl[drop = TRUE], df[drop = TRUE])
check(tbl[1, drop = TRUE], df[1, drop = TRUE])
expect_equal(tbl[, 1, drop = TRUE], df[, 1, drop = TRUE]) ## because we don't want as.df
expect_equal(tbl[1:2, , drop = TRUE], df[1:2, , drop = TRUE]) ## don't want as.df

hadley · 2014-09-12T20:54:40Z

One more test case: tbl[,] - it's horrifying how complicated this code has to be :/

kevinushey · 2014-09-12T20:58:16Z

Works, but with the different default for drop, ie:

> tbl[,]
Source: local data frame [4 x 1]

  x
1 1
2 2
3 3
4 4
> df[,]
[1] 1 2 3 4
> df[, drop = FALSE]
  x
1 1
2 2
3 3
4 4
Warning message:
In `[.data.frame`(df, , drop = FALSE) : 'drop' argument will be ignored
> df[,, drop = FALSE]
  x
1 1
2 2
3 3
4 4

I was forced to laugh a bit at the warning...

hadley · 2014-09-12T21:00:08Z

The last one really should be df[,,drop = FALSE].

trinker · 2014-10-01T03:48:42Z

May I ask why this is a desirable behavior? If tbl_df is meant to be used in all the same ways data.frame is used then it may be confusing and cause errors in functions that rely on the behavior of df[, index] rather than df[[index]].

Here's an example that used to be fine with tbl_df because indexing a single column resulted in a vector.

FUN <- function(x, y) {
    if (is.data.frame(x)) stop("supply vector")
    cor(x, y)
}

x <- tbl_df(mtcars)
FUN(x[, "mpg"], x[, "hp"])

Now to get the same effect the user would have to use FUN(x[["mpg"]], x[["hp"]]).

It's likely people use a similar check of a formal throughout many R add on packages. This means dplyr''s tbl_df would cause an error with those functions.

paste comes to mind as a place this would cause issue:

x <- tbl_df(mtcars)
paste(x[, 1], collapse="|")
## vs.
paste(mtcars[, 1], collapse="|")

Similarly this means S3 classed functions that are serching for a vector of some sort, if you index as df[, index] then the incorrect method may be called:

x <- tbl_df(mtcars)

FUN2 <- function(x, ...) {
    UseMethod("FUN2")
}

FUN2.numeric <- function(x, ...) {
    message("A vector can't do `cor`")
}

FUN2.data.frame <- function(x, ...) {
    message(if(ncol(x) == 1) warning("You need 2 vectors for a correlation"))
    cor(x)
}

FUN2(x[, 1])
## You need 2 vectors for a correlation
##     mpg
## mpg   1
## Warning message:
## In message(if (ncol(x) == 1) warning("You need 2 vectors for a correlation")) :
##   You need 2 vectors for a correlation

FUN2(mtcars[, 1])
## A vector can't do `cor`

gvfarns · 2014-10-10T22:23:24Z

This feature is very problematic. It breaks existing code of mine that depends on the drop=TRUE default of data frames. When I convert code over to dplyr, I unexpectedly get errors later in the code. In fact, [.tbl_df does not appear to have the option of drop=TRUE so I can't make the direct change to make things work as expected. As a result, I end up stripping "tbl_df" from the class of my dataframes/tbl_df's all over the place.

With respect, one of the biggest advantages dplyr has over other approaches, such as those of data.table, is that it doesn't overload basic operators like [ in such a way that existing R code breaks.

Unless there is a very critical reason to depart from the default behavior of data frames, I would respectfully encourage you to reconsider this decision.

hadley · 2014-10-10T23:12:23Z

@gvfarns regardless of whether or not we rollback this change you can always use [[

kevinushey · 2014-10-13T19:43:36Z

If enough people are uncomfortable with this it could be worth rolling back.

Honestly, I think df[, i] should be discouraged for data.frames (it is of course okay if you're living in a data.table world). But in general:

Subset a data.frame by columns: df[c(...)],
Get a particular column: df[[i]]
Get rows: df[i, ]

and the df[, i] syntax should be mostly avoided.

kevinushey · 2014-10-13T19:47:36Z

Put another way: dplyr shouldn't touch the [ function; it should discourage its use (and maybe provide a couple more verbs for row-wise or column-wise subsetting by index)

Fablepongiste · 2015-02-25T11:04:08Z

So is this definitive or a rollback still studied ?

will this code always give different results :

df <- data.frame(a = c(1:3), b = c(2:4))
df[,'a'] -> returns vector numeric
# 
tbl_df(df)[,'a']  -> returns a local data frame with one column a ?

And if yes, you advise to use df[['a']] and tbl_df(df)[['a']] for consistent behaviour between both ?

hadley · 2015-02-25T13:25:22Z

I'm pretty sure this behaviour is going to stay. Use [ to return a data frame, use [[ to return a column.

Closes tidyverse#587. Closes tidyverse#610

hadley added the feature a feature request or enhancement label Sep 11, 2014

hadley added this to the 0.3 milestone Sep 11, 2014

hadley self-assigned this Sep 11, 2014

hadley closed this as completed in 6e78f57 Sep 22, 2014

dickoa mentioned this issue Mar 5, 2015

dplyr data_frame breaks performance mlr-org/mlr#253

Closed

zmjones mentioned this issue Mar 31, 2015

dplyr drop generates bug in predict() mlr-org/mlr#273

Closed

ijlyttle mentioned this issue Aug 25, 2015

advertize that tbl_df subsetting behavior is different from data.frame #1345

Closed

dannguyen mentioned this issue Sep 12, 2015

mutate_geocode() is incompatible with the behavior of tbl_df-type data frames dkahle/ggmap#68

Merged

krlmlr pushed a commit to krlmlr/dplyr that referenced this issue Mar 2, 2016

Better subsetting for tbls.

e647288

Closes tidyverse#587. Closes tidyverse#610

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tbl_df[, i] shouldn't coerce to vector #587

tbl_df[, i] shouldn't coerce to vector #587

hadley commented Sep 8, 2014

kevinushey commented Sep 8, 2014

hadley commented Sep 9, 2014

hadley commented Sep 12, 2014

kevinushey commented Sep 12, 2014

kevinushey commented Sep 12, 2014

hadley commented Sep 12, 2014

kevinushey commented Sep 12, 2014

hadley commented Sep 12, 2014

kevinushey commented Sep 12, 2014

hadley commented Sep 12, 2014

trinker commented Oct 1, 2014

gvfarns commented Oct 10, 2014

hadley commented Oct 10, 2014

kevinushey commented Oct 13, 2014

kevinushey commented Oct 13, 2014

Fablepongiste commented Feb 25, 2015

hadley commented Feb 25, 2015

tbl_df[, i] shouldn't coerce to vector #587

tbl_df[, i] shouldn't coerce to vector #587

Comments

hadley commented Sep 8, 2014

kevinushey commented Sep 8, 2014

hadley commented Sep 9, 2014

hadley commented Sep 12, 2014

kevinushey commented Sep 12, 2014

kevinushey commented Sep 12, 2014

hadley commented Sep 12, 2014

kevinushey commented Sep 12, 2014

hadley commented Sep 12, 2014

kevinushey commented Sep 12, 2014

hadley commented Sep 12, 2014

trinker commented Oct 1, 2014

gvfarns commented Oct 10, 2014

hadley commented Oct 10, 2014

kevinushey commented Oct 13, 2014

kevinushey commented Oct 13, 2014

Fablepongiste commented Feb 25, 2015

hadley commented Feb 25, 2015