Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr drop generates bug in predict() #273

Closed
zmjones opened this issue Mar 31, 2015 · 13 comments
Closed

dplyr drop generates bug in predict() #273

zmjones opened this issue Mar 31, 2015 · 13 comments
Assignees

Comments

@zmjones
Copy link
Contributor

zmjones commented Mar 31, 2015

dplyr does not drop a one_column data.frame which also inherits from tbl_df to a vector, which generates a bug in predict.

see #253 and tidyverse/dplyr#587 as @dickoa mentioned. as @berndbischl noted, it is unclear if this should be handled inside mlr or if it should just be checked for. the near silent failure (only the warning) it generates with the below example is not great though.

library(dplyr)
test = data_frame(x = 0:5, y = 5:10)
str(test[, 1, drop = TRUE])
## Classes 'tbl_df' and 'data.frame':   6 obs. of  1 variable:
## $ x: int  0 1 2 3 4 5
## Warning message:
## drop ignored 
> str(test[[1]])
## [1] 0 1 2 3 4 5
@berndbischl
Copy link
Member

I dont really care about this now. Reopen when it becomes important.

@ghost
Copy link

ghost commented Nov 3, 2015

I just encountered this issue (error from mlr that truth column is not there) when tuning gbm on some data preprocessed with dplyr. It took me some time to figure out (mostly by chance) that this is an issue due to dplyr. Using as.data.frame(task_data) before creating a task solved the issue but it would be nice if this is somewhere documented.

@larskotthoff
Copy link
Member

@bkomboz Thanks for the report -- could you provide a complete example please? This may be something that we should check in mlr.

@zmjones
Copy link
Contributor Author

zmjones commented Nov 3, 2015

I think this is the same issue that I documented above (and in #253). As I recall @berndbischl argued for not doing anything about it because a tbl_df is not (simply) a data.frame and all of the mlr infrastructure which uses the data element of the task (in particular performance, here) expects this element to behave like a data.frame. So the user should create tasks using a data.frame rather than something else.

@berndbischl
Copy link
Member

Yes. But we should make life as simple as possible for users. We should really have an FAQ.
Also, dplyr is becoming so popular that we should either directly support it or add some check.

@larskotthoff
Copy link
Member

It also sounds like the problem occurred somewhere during tuning, i.e. deep within mlr and that obscures the true cause of the problem.

@berndbischl
Copy link
Member

As we dont really use and exploit dplyr in mlr (yet), shouldn't we simply "as.data.frame" the data on task creation, if we are passed a dplyr object? and everything is fine?

@larskotthoff
Copy link
Member

Sounds good to me.

@ghost
Copy link

ghost commented Nov 4, 2015

Thanks for all the responses.

@larskotthoff: Here is a minimal example showing the problem:

library("mlr")
library("dplyr")

pid.df <- getTaskData(pid.task)
pid.tbl <- as.tbl(pid.df)
pid.task <- makeClassifTask(id = "pid.task", data = pid.tbl, target = "diabetes")

lrn <- makeLearner("classif.gbm", predict.type = "response", par.vals = list(distribution = "bernoulli"))
ps <- makeParamSet(
    makeDiscreteParam(id = "n.trees", values = c(100, 200)),
    makeDiscreteParam(id = "shrinkage", values = c(0.1, 0.01)))
ctrl <- makeTuneControlGrid()
rdesc.inner <- makeResampleDesc("Subsample", iters = 5)
rdesc.outer <- makeResampleDesc("CV", iters = 5)
lrn.tuned <- makeTuneWrapper(learner = lrn, resampling = rdesc.inner, par.set = ps, control = ctrl)

res <- resample(learner = lrn.tuned, task = pid.task, resampling = rdesc.outer)

I think @berndbischl suggestion to do as.data.frame on the data by task creation solves the issue for now especially if dplyr is not used internally.

Edit: But it would be nice that if a tbl is used to create a task, getTaskData would also give a tbl back. For this, however, one would need to store this information somewhere with the task description.

@larskotthoff
Copy link
Member

Ok, it looks to me like using as.data.frame is the best option here, and document/warn.

I think there's a more general issue here on how to get mlr to play nice with dplyr, but this is a larger thing that will require some effort.

@berndbischl
Copy link
Member

yes, we can / should really use dplyr or datatable internally for speed reasons. but we cannot do this now, maybe this might be mlr 3.0

@berndbischl
Copy link
Member

i guess we open a new clean issue for this now?

@larskotthoff
Copy link
Member

Yep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants