dplyr data_frame breaks performance #253

zmjones · 2015-03-04T23:33:57Z

Code to reproduce the bug is below. I only tested it with cforest so I could be wrong that dplyr is the only cause. I think a better way to handle this would be to strip the attributes when making the task, as this could be introducing other bugs. Is assertDataFrame supposed to check for this sort of thing? I looked for it but couldn't find it.

library(mlr)
library(dplyr)

data(swiss)

task <- makeRegrTask("swiss", swiss, "Fertility")
learner <- makeLearner("regr.cforest", "swiss", ntree = 1000)
fit <- train(learner, task)
pred <- predict(fit, task)
## colnames(pred$data)[2] <- "truth"
performance(pred, measures = list(mse, mae))

task <- makeRegrTask("swiss", as_data_frame(swiss), "Fertility")
learner <- makeLearner("regr.cforest", "swiss", ntree = 1000)
fit <- train(learner, task)
pred <- predict(fit, task)
performance(pred, mse)
## Error in FUN(X[[1L]], ...) : 
##   You need to have a 'truth' column in your pred object for measure mse!
colnames(pred$data)[2] <- "truth"
performance(pred, mse)
##      mse 
## 87.87757

berndbischl · 2015-03-05T09:10:25Z

assertDataframe is function from @mllg very nice checkmate package

https://github.com/mllg/checkmate

berndbischl · 2015-03-05T09:48:19Z

The problem is actually sooner in mlr::predict

We always name the column "truth"

Which, as one can see in your example, does already fail,it is called "Fertility" in pred$data.

The reason is here in mlr::predict, where "truth" is not a vector, but a data.frame

  if (!all(is.na(t.col))) {
    if (length(t.col) > 1L && anyMissing(t.col))
      stop("Some but not all target columns found in data")
    truth = newdata[, t.col, drop = TRUE]
    newdata = newdata[, -t.col, drop = FALSE]
  } else {
    truth = NULL
  }

Actually ddplyr warns about this

Warning message:
drop ignored

Now we get into a "nice" argument: Do you, in the construction call actually pass a valid data.frame
here

makeRegrTask("swiss", as_data_frame(swiss), "Fertility")

My argument could be:
No, because the object you passed does not inherit the full interface of R::data.frame.
I guess: Hadley "downgraded" this in dplyr (by disabling "drop", what else?). Then still said "I am inheriting from the normal data.frame".
IMHO this is not really valid OO design?

berndbischl · 2015-03-05T09:49:51Z

PS:
and then dplyr - as ALWAYS in R - WARNS about it only. Making "type safe" programming again not really possible. Which is also uncool.

berndbischl · 2015-03-05T09:56:42Z

@zmjones
Is anywhere documented that "drop" is ignored by dplyr? And why that is?

dickoa · 2015-03-05T10:54:50Z

@berndbischl
You can always use the newdata[[t.col]] notation.
You will find more info about tbl_df behavior here: tidyverse/dplyr#587

berndbischl · 2015-03-05T11:00:01Z

@dickoa : Many thanks for the hint.

I will look this up later. I guess I would need to set options(warn = 2L), then see if all our unit tests run with a dplyr tbl_df.
The problem is now, that dplyr forces the burden on us, possibly implying multiple code line changes, so that this works. But I guess the other indexing notation anyway the better indexing way in general....

Convert data if necessary (Fixes Issue #714, #253, #273)

drops names from truth column

8bb37fa

zmjones changed the title ~~dplyr data_frame breaks performance~~ dplyr data_frame breaks performance Mar 5, 2015

zmjones closed this Mar 31, 2015

zmjones mentioned this pull request Mar 31, 2015

dplyr drop generates bug in predict() #273

Closed

zmjones deleted the dplyr_regr_bug branch March 31, 2015 17:41

berndbischl mentioned this pull request Feb 12, 2016

training regr.randomForest fails for data frame tbl. (dplyr) #714

Closed

jakob-r added a commit that referenced this pull request Feb 15, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

24cd84a

jakob-r added a commit that referenced this pull request Feb 16, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

385d974

jakob-r added a commit that referenced this pull request Feb 16, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

108f06d

jakob-r added a commit that referenced this pull request Feb 16, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

1690b6b

jakob-r assigned jakob-r and unassigned jakob-r Feb 17, 2016

jakob-r added a commit that referenced this pull request Feb 17, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

fe97c88

larskotthoff pushed a commit that referenced this pull request Mar 8, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

1f3afa4

larskotthoff added a commit that referenced this pull request Mar 8, 2016

Merge pull request #723 from mlr-org/fix_714_as_data_frame

4357354

Convert data if necessary (Fixes Issue #714, #253, #273)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dplyr data_frame breaks performance #253

dplyr data_frame breaks performance #253

zmjones commented Mar 4, 2015

berndbischl commented Mar 5, 2015

berndbischl commented Mar 5, 2015

berndbischl commented Mar 5, 2015

berndbischl commented Mar 5, 2015

dickoa commented Mar 5, 2015

berndbischl commented Mar 5, 2015

dplyr data_frame breaks performance #253

dplyr data_frame breaks performance #253

Conversation

zmjones commented Mar 4, 2015

berndbischl commented Mar 5, 2015

berndbischl commented Mar 5, 2015

berndbischl commented Mar 5, 2015

berndbischl commented Mar 5, 2015

dickoa commented Mar 5, 2015

berndbischl commented Mar 5, 2015