Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr data_frame breaks performance #253

Closed
wants to merge 1 commit into from
Closed

dplyr data_frame breaks performance #253

wants to merge 1 commit into from

Conversation

zmjones
Copy link
Contributor

@zmjones zmjones commented Mar 4, 2015

Code to reproduce the bug is below. I only tested it with cforest so I could be wrong that dplyr is the only cause. I think a better way to handle this would be to strip the attributes when making the task, as this could be introducing other bugs. Is assertDataFrame supposed to check for this sort of thing? I looked for it but couldn't find it.

library(mlr)
library(dplyr)

data(swiss)

task <- makeRegrTask("swiss", swiss, "Fertility")
learner <- makeLearner("regr.cforest", "swiss", ntree = 1000)
fit <- train(learner, task)
pred <- predict(fit, task)
## colnames(pred$data)[2] <- "truth"
performance(pred, measures = list(mse, mae))

task <- makeRegrTask("swiss", as_data_frame(swiss), "Fertility")
learner <- makeLearner("regr.cforest", "swiss", ntree = 1000)
fit <- train(learner, task)
pred <- predict(fit, task)
performance(pred, mse)
## Error in FUN(X[[1L]], ...) : 
##   You need to have a 'truth' column in your pred object for measure mse!
colnames(pred$data)[2] <- "truth"
performance(pred, mse)
##      mse 
## 87.87757 

@zmjones zmjones changed the title dplyr data_frame breaks performance dplyr data_frame breaks performance Mar 5, 2015
@berndbischl
Copy link
Member

assertDataframe is function from @mllg very nice checkmate package

https://github.com/mllg/checkmate

@berndbischl
Copy link
Member

The problem is actually sooner in mlr::predict

We always name the column "truth"

Which, as one can see in your example, does already fail,it is called "Fertility" in pred$data.

The reason is here in mlr::predict, where "truth" is not a vector, but a data.frame

  if (!all(is.na(t.col))) {
    if (length(t.col) > 1L && anyMissing(t.col))
      stop("Some but not all target columns found in data")
    truth = newdata[, t.col, drop = TRUE]
    newdata = newdata[, -t.col, drop = FALSE]
  } else {
    truth = NULL
  }

Actually ddplyr warns about this

Warning message:
drop ignored

Now we get into a "nice" argument: Do you, in the construction call actually pass a valid data.frame
here

makeRegrTask("swiss", as_data_frame(swiss), "Fertility")

My argument could be:
No, because the object you passed does not inherit the full interface of R::data.frame.
I guess: Hadley "downgraded" this in dplyr (by disabling "drop", what else?). Then still said "I am inheriting from the normal data.frame".
IMHO this is not really valid OO design?

@berndbischl
Copy link
Member

PS:
and then dplyr - as ALWAYS in R - WARNS about it only. Making "type safe" programming again not really possible. Which is also uncool.

@berndbischl
Copy link
Member

@zmjones
Is anywhere documented that "drop" is ignored by dplyr? And why that is?

@dickoa
Copy link

dickoa commented Mar 5, 2015

@berndbischl
You can always use the newdata[[t.col]] notation.
You will find more info about tbl_df behavior here: tidyverse/dplyr#587

@berndbischl
Copy link
Member

@dickoa : Many thanks for the hint.

I will look this up later. I guess I would need to set options(warn = 2L), then see if all our unit tests run with a dplyr tbl_df.
The problem is now, that dplyr forces the burden on us, possibly implying multiple code line changes, so that this works. But I guess the other indexing notation anyway the better indexing way in general....

@zmjones zmjones closed this Mar 31, 2015
@zmjones zmjones deleted the dplyr_regr_bug branch March 31, 2015 17:41
@jakob-r jakob-r assigned jakob-r and unassigned jakob-r Feb 17, 2016
larskotthoff added a commit that referenced this pull request Mar 8, 2016
Convert data if necessary (Fixes Issue #714, #253, #273)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants