dplyr drop generates bug in predict() #273

zmjones · 2015-03-31T17:39:22Z

dplyr does not drop a one_column data.frame which also inherits from tbl_df to a vector, which generates a bug in predict.

see #253 and tidyverse/dplyr#587 as @dickoa mentioned. as @berndbischl noted, it is unclear if this should be handled inside mlr or if it should just be checked for. the near silent failure (only the warning) it generates with the below example is not great though.

library(dplyr)
test = data_frame(x = 0:5, y = 5:10)
str(test[, 1, drop = TRUE])
## Classes 'tbl_df' and 'data.frame':   6 obs. of  1 variable:
## $ x: int  0 1 2 3 4 5
## Warning message:
## drop ignored 
> str(test[[1]])
## [1] 0 1 2 3 4 5

The text was updated successfully, but these errors were encountered:

berndbischl · 2015-06-09T04:42:34Z

I dont really care about this now. Reopen when it becomes important.

ghost · 2015-11-03T14:18:30Z

I just encountered this issue (error from mlr that truth column is not there) when tuning gbm on some data preprocessed with dplyr. It took me some time to figure out (mostly by chance) that this is an issue due to dplyr. Using as.data.frame(task_data) before creating a task solved the issue but it would be nice if this is somewhere documented.

larskotthoff · 2015-11-03T17:15:00Z

@bkomboz Thanks for the report -- could you provide a complete example please? This may be something that we should check in mlr.

zmjones · 2015-11-03T18:12:21Z

I think this is the same issue that I documented above (and in #253). As I recall @berndbischl argued for not doing anything about it because a tbl_df is not (simply) a data.frame and all of the mlr infrastructure which uses the data element of the task (in particular performance, here) expects this element to behave like a data.frame. So the user should create tasks using a data.frame rather than something else.

berndbischl · 2015-11-03T18:17:30Z

Yes. But we should make life as simple as possible for users. We should really have an FAQ.
Also, dplyr is becoming so popular that we should either directly support it or add some check.

larskotthoff · 2015-11-03T18:22:11Z

It also sounds like the problem occurred somewhere during tuning, i.e. deep within mlr and that obscures the true cause of the problem.

berndbischl · 2015-11-03T18:23:37Z

As we dont really use and exploit dplyr in mlr (yet), shouldn't we simply "as.data.frame" the data on task creation, if we are passed a dplyr object? and everything is fine?

larskotthoff · 2015-11-03T18:28:02Z

Sounds good to me.

ghost · 2015-11-04T08:32:37Z

Thanks for all the responses.

@larskotthoff: Here is a minimal example showing the problem:

library("mlr")
library("dplyr")

pid.df <- getTaskData(pid.task)
pid.tbl <- as.tbl(pid.df)
pid.task <- makeClassifTask(id = "pid.task", data = pid.tbl, target = "diabetes")

lrn <- makeLearner("classif.gbm", predict.type = "response", par.vals = list(distribution = "bernoulli"))
ps <- makeParamSet(
    makeDiscreteParam(id = "n.trees", values = c(100, 200)),
    makeDiscreteParam(id = "shrinkage", values = c(0.1, 0.01)))
ctrl <- makeTuneControlGrid()
rdesc.inner <- makeResampleDesc("Subsample", iters = 5)
rdesc.outer <- makeResampleDesc("CV", iters = 5)
lrn.tuned <- makeTuneWrapper(learner = lrn, resampling = rdesc.inner, par.set = ps, control = ctrl)

res <- resample(learner = lrn.tuned, task = pid.task, resampling = rdesc.outer)

I think @berndbischl suggestion to do as.data.frame on the data by task creation solves the issue for now especially if dplyr is not used internally.

Edit: But it would be nice that if a tbl is used to create a task, getTaskData would also give a tbl back. For this, however, one would need to store this information somewhere with the task description.

larskotthoff · 2015-11-04T17:38:22Z

Ok, it looks to me like using as.data.frame is the best option here, and document/warn.

I think there's a more general issue here on how to get mlr to play nice with dplyr, but this is a larger thing that will require some effort.

berndbischl · 2015-11-04T19:57:16Z

yes, we can / should really use dplyr or datatable internally for speed reasons. but we cannot do this now, maybe this might be mlr 3.0

berndbischl · 2015-11-04T19:58:06Z

i guess we open a new clean issue for this now?

larskotthoff · 2015-11-04T19:58:56Z

Yep.

Convert data if necessary (Fixes Issue #714, #253, #273)

berndbischl closed this as completed Jun 9, 2015

berndbischl added the wontfix label Jun 9, 2015

berndbischl mentioned this issue Feb 12, 2016

training regr.randomForest fails for data frame tbl. (dplyr) #714

Closed

jakob-r added a commit that referenced this issue Feb 15, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

24cd84a

jakob-r added a commit that referenced this issue Feb 16, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

385d974

jakob-r added a commit that referenced this issue Feb 16, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

108f06d

jakob-r added a commit that referenced this issue Feb 16, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

1690b6b

jakob-r self-assigned this Feb 17, 2016

jakob-r added a commit that referenced this issue Feb 17, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

fe97c88

larskotthoff pushed a commit that referenced this issue Mar 8, 2016

Convert data if necessary (Fixes Issue #714, #253, #273)

1f3afa4

larskotthoff added a commit that referenced this issue Mar 8, 2016

Merge pull request #723 from mlr-org/fix_714_as_data_frame

4357354

Convert data if necessary (Fixes Issue #714, #253, #273)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dplyr drop generates bug in predict() #273

dplyr drop generates bug in predict() #273

zmjones commented Mar 31, 2015

berndbischl commented Jun 9, 2015

ghost commented Nov 3, 2015

larskotthoff commented Nov 3, 2015

zmjones commented Nov 3, 2015

berndbischl commented Nov 3, 2015

larskotthoff commented Nov 3, 2015

berndbischl commented Nov 3, 2015

larskotthoff commented Nov 3, 2015

ghost commented Nov 4, 2015

larskotthoff commented Nov 4, 2015

berndbischl commented Nov 4, 2015

berndbischl commented Nov 4, 2015

larskotthoff commented Nov 4, 2015

dplyr drop generates bug in predict() #273

dplyr drop generates bug in predict() #273

Comments

zmjones commented Mar 31, 2015

berndbischl commented Jun 9, 2015

ghost commented Nov 3, 2015

larskotthoff commented Nov 3, 2015

zmjones commented Nov 3, 2015

berndbischl commented Nov 3, 2015

larskotthoff commented Nov 3, 2015

berndbischl commented Nov 3, 2015

larskotthoff commented Nov 3, 2015

ghost commented Nov 4, 2015

larskotthoff commented Nov 4, 2015

berndbischl commented Nov 4, 2015

berndbischl commented Nov 4, 2015

larskotthoff commented Nov 4, 2015