Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tbl_df[, i] shouldn't coerce to vector #587

Closed
hadley opened this issue Sep 8, 2014 · 17 comments
Closed

tbl_df[, i] shouldn't coerce to vector #587

hadley opened this issue Sep 8, 2014 · 17 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@hadley
Copy link
Member

hadley commented Sep 8, 2014

No description provided.

@kevinushey
Copy link
Contributor

Would it be sufficient to just delegate to [.data.frame w/ drop = FALSE and coerce back to tbl_df at the end? E.g.

`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  as.tbl(`[.data.frame`(x, i, j, drop = drop))
}

@hadley
Copy link
Member Author

hadley commented Sep 9, 2014

Yes exactly. I think we might already has a method for grouped objects.

@hadley hadley added the feature a feature request or enhancement label Sep 11, 2014
@hadley hadley added this to the 0.3 milestone Sep 11, 2014
@hadley hadley self-assigned this Sep 11, 2014
@hadley
Copy link
Member Author

hadley commented Sep 12, 2014

Ugh, I give up for now. Here are some test cases

test_that("tbl_df doesn't drop by default", {
  df <- data_frame(x = 1:4)

  expect_equal(df[], df)
  expect_equal(df[1], df)
  expect_equal(df[, 1], df)
  expect_equal(df[1:2, ], df)
})

the obvious stuff like @kevinushey's idea doesn't work because of the special weirdness of [.

@kevinushey
Copy link
Contributor

Yuck. This is pretty much because of these being different:

df <- data.frame(x = 1:4)
`[.data.frame`(df, 1)
`[.data.frame`(df, 1, )

Is there any way to differentiate between those two cases in a sensible way?

@kevinushey
Copy link
Contributor

Here's something gross:

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  ## Needed to distinguish between x[i] and x[i, ]
  num_args <- nargs()

  ## If we have 3 or more arguments, can use regular dispatch
  if (num_args >= 3L) {
    return(as.tbl(`[.data.frame`(x, i, j, drop)))
  }

  ## Otherwise, dispatch depending on missingness
  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(as.tbl(`[.data.frame`(x, , i, drop)))
    }
  }
  as.tbl(`[.data.frame`(x, i, j, drop))
}

It passes these tests:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

Part of me thinks we should just roll our own subsetting methods though, rather than delegating to [.data.frame.

@hadley
Copy link
Member Author

hadley commented Sep 12, 2014

Does your code work with an explicit drop?

Reimplementing subsetting for data frames from scratch would be a lot of work :(

@kevinushey
Copy link
Contributor

Iteration two (maybe I'll put this in a PR if this seems like the route we want to go):

maybe_tbl_df <- function(object) {
  if (is.data.frame(object)) {
    tbl_df(object)
  } else {
    object
  }
}

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {

  num_args <- nargs() - !missing(drop)
  if (num_args >= 3L) {
    return(maybe_tbl_df(`[.data.frame`(x, i, j, drop)))
  }

  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(maybe_tbl_df(`[.data.frame`(x, , i, drop)))
    }
  }
  maybe_tbl_df(`[.data.frame`(x, i, j, drop))
}

passes:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl, df)

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

check(tbl[drop = FALSE], df[drop = FALSE])
check(tbl[1, drop = FALSE], df[1, drop = FALSE])
check(tbl[, 1, drop = FALSE], df[, 1, drop = FALSE])
check(tbl[1:2, , drop = FALSE], df[1:2, , drop = FALSE])

check(tbl[drop = TRUE], df[drop = TRUE])
check(tbl[1, drop = TRUE], df[1, drop = TRUE])
expect_equal(tbl[, 1, drop = TRUE], df[, 1, drop = TRUE]) ## because we don't want as.df
expect_equal(tbl[1:2, , drop = TRUE], df[1:2, , drop = TRUE]) ## don't want as.df

@hadley
Copy link
Member Author

hadley commented Sep 12, 2014

One more test case: tbl[,] - it's horrifying how complicated this code has to be :/

@kevinushey
Copy link
Contributor

Works, but with the different default for drop, ie:

> tbl[,]
Source: local data frame [4 x 1]

  x
1 1
2 2
3 3
4 4
> df[,]
[1] 1 2 3 4
> df[, drop = FALSE]
  x
1 1
2 2
3 3
4 4
Warning message:
In `[.data.frame`(df, , drop = FALSE) : 'drop' argument will be ignored
> df[,, drop = FALSE]
  x
1 1
2 2
3 3
4 4

I was forced to laugh a bit at the warning...

@hadley
Copy link
Member Author

hadley commented Sep 12, 2014

The last one really should be df[,,drop = FALSE].

@hadley hadley closed this as completed in 6e78f57 Sep 22, 2014
@trinker
Copy link

trinker commented Oct 1, 2014

May I ask why this is a desirable behavior? If tbl_df is meant to be used in all the same ways data.frame is used then it may be confusing and cause errors in functions that rely on the behavior of df[, index] rather than df[[index]].

Here's an example that used to be fine with tbl_df because indexing a single column resulted in a vector.

FUN <- function(x, y) {
    if (is.data.frame(x)) stop("supply vector")
    cor(x, y)
}

x <- tbl_df(mtcars)
FUN(x[, "mpg"], x[, "hp"])

Now to get the same effect the user would have to use FUN(x[["mpg"]], x[["hp"]]).

It's likely people use a similar check of a formal throughout many R add on packages. This means dplyr''s tbl_df would cause an error with those functions.

paste comes to mind as a place this would cause issue:

x <- tbl_df(mtcars)
paste(x[, 1], collapse="|")
## vs.
paste(mtcars[, 1], collapse="|")

Similarly this means S3 classed functions that are serching for a vector of some sort, if you index as df[, index] then the incorrect method may be called:

x <- tbl_df(mtcars)

FUN2 <- function(x, ...) {
    UseMethod("FUN2")
}

FUN2.numeric <- function(x, ...) {
    message("A vector can't do `cor`")
}

FUN2.data.frame <- function(x, ...) {
    message(if(ncol(x) == 1) warning("You need 2 vectors for a correlation"))
    cor(x)
}

FUN2(x[, 1])
## You need 2 vectors for a correlation
##     mpg
## mpg   1
## Warning message:
## In message(if (ncol(x) == 1) warning("You need 2 vectors for a correlation")) :
##   You need 2 vectors for a correlation

FUN2(mtcars[, 1])
## A vector can't do `cor`

@gvfarns
Copy link

gvfarns commented Oct 10, 2014

This feature is very problematic. It breaks existing code of mine that depends on the drop=TRUE default of data frames. When I convert code over to dplyr, I unexpectedly get errors later in the code. In fact, [.tbl_df does not appear to have the option of drop=TRUE so I can't make the direct change to make things work as expected. As a result, I end up stripping "tbl_df" from the class of my dataframes/tbl_df's all over the place.

With respect, one of the biggest advantages dplyr has over other approaches, such as those of data.table, is that it doesn't overload basic operators like [ in such a way that existing R code breaks.

Unless there is a very critical reason to depart from the default behavior of data frames, I would respectfully encourage you to reconsider this decision.

@hadley
Copy link
Member Author

hadley commented Oct 10, 2014

@gvfarns regardless of whether or not we rollback this change you can always use [[

@kevinushey
Copy link
Contributor

If enough people are uncomfortable with this it could be worth rolling back.

Honestly, I think df[, i] should be discouraged for data.frames (it is of course okay if you're living in a data.table world). But in general:

Subset a data.frame by columns: df[c(...)],
Get a particular column: df[[i]]
Get rows: df[i, ]

and the df[, i] syntax should be mostly avoided.

@kevinushey
Copy link
Contributor

Put another way: dplyr shouldn't touch the [ function; it should discourage its use (and maybe provide a couple more verbs for row-wise or column-wise subsetting by index)

@Fablepongiste
Copy link

So is this definitive or a rollback still studied ?

will this code always give different results :

df <- data.frame(a = c(1:3), b = c(2:4))
df[,'a'] -> returns vector numeric
# 
tbl_df(df)[,'a']  -> returns a local data frame with one column a ?

And if yes, you advise to use df[['a']] and tbl_df(df)[['a']] for consistent behaviour between both ?

@hadley
Copy link
Member Author

hadley commented Feb 25, 2015

I'm pretty sure this behaviour is going to stay. Use [ to return a data frame, use [[ to return a column.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants