-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Condition Trigger to affect dependencies using cross #1041
Comments
Maybe the problem itself was already related to the target not existing beforehand, such as in #616. If so, what would be the alternative to using triggers? I am devising a pipeline that will deal with data in different conditions, so for some of them I would like to run all parts of the basic plan, and for others I would preffer to skip some of those. Think of this as a general data treatment pipeline, where data in different conditions would be treated accordingly. But the intent would be to avoid creating a new plan for each new data. |
I think the library(drake)
plan <- drake_plan(
x = target(w, transform = map(w = !!seq_len(1e2))),
y = target(x, transform = map(x, .id = w)),
z = target(y, transform = map(y, .id = w))
)
nrow(plan) # lots of targets
#> [1] 300
plot(plan) # dev version only make(plan, targets = c("z_1L", "z_2L")) # only run some
#> target x_1L
#> target x_2L
#> target y_1L
#> target y_2L
#> target z_1L
#> target z_2L Created on 2019-10-30 by the reprex package (v0.3.0) |
The issue with triggers is they always require the target to exist before any kind of skipping. |
Thanks Will! That solved it for me. Maybe, given #616 #685 #935 you would like to think in a more flexible way to handle Drake's cache and target update going forward. For example, the triggers and targets could be unified in an eventual refactoring and other rules not attached to the name of target could be created. That way, the problem described in #935 would be solved. Not that it is an easy task, I am just giving you food for thought. Tell me if I can help with anything. |
Is it just the issue of renaming targets? What other kinds of problems are you thinking about exactly? There is a way to rename a target without necessarily incurring a rebuild under certain conditions. If you keep the RNG seed constant ( For complete freedom to rename targets with impunity, plan <- drake_plan(
data = get_data(file_in("https://example.com")),
munge = munge_data(data),
analysis = analyze(munge)
) When it comes time to run |
No - the names themselves are not important. The issue is with changing the plan's parameters or the function calls, which makes Drake rebuild the targets unnecessarily. I think it is a common way to program:
Each of those steps would make Drake update all the targets and dependencies from that point onwards, even if the parameters evaluate to the same function specification. If the targets are slow to build, this creates a big overhead. Of course, it would be impossible to create a parser that understands exactly each and any modification you do in a function call, but some simple examples such as in the one provided in #935 are not so complex to implement, I think. If we could do this at least with the commands of
An MRE to make what I am saying more clear: library(xgboost)
library(drake)
plan_1 <-
drake_plan(
my_data = mtcars,
model = target(
xgboost(
data = my_data,
eta = 0.1,
max_depth = 10,
nround= nround,
subsample = 0.5,
colsample_bytree = 0.5,
seed = 1,
eval_metric = "auc",
verbose = 0,
nthread = 1,
objective = "binary:logistic"),
transform = map( nround = 25)
)
)
plan_2 <-
drake_plan(
my_data = mtcars,
model = target(
xgboost(
data = my_data,
eta = 0.1,
max_depth = 10,
nround= nround,
subsample = subsample,
colsample_bytree = 0.5,
seed = 1,
eval_metric = "auc",
verbose = 0,
nthread = 1,
objective = "binary:logistic"),
transform = cross( nround = !!c(25, 50, 75), subsample = !!c(0.3, 0.5))
)
)
If you couple this with the triggers feature, that can't stop you for building specific targets if they are not built before, and the targets argument, which forces you to select the items you want to run manually, there seems to be a certain overlap of features that may be hard to maintain but, at the same time, are not so flexible. I think this overall discussion enters the scope of #685. A more flexible way to evaluate the plan leads to the decision of what to build and create in real-time, based on the specification of the targets built before. A sketch of the structure could be, instead of the current one, something that prioritizes, from high to low:
That would create some overhead since Drake would have to hash all functions in the global environment in real time. And of course, I still don't have a deep view of Drake internals. You are in a much better position to evaluate what is factible or not, and the effort that woud be needed. |
If you add new hyperparameter combos without changing the commands of old targets, you can use data recovery to rename old targets. Below, library(drake)
mock_xgboost <- function(...) {
NULL
}
plan_1 <- drake_plan(
my_data = mtcars,
model = target(
mock_xgboost(data = my_data, nround = nround, subsample = 0.5),
transform = map(nround = 25)
)
)
make(plan_1)
#> target my_data
#> target model_25
plan_2 <- drake_plan(
my_data = mtcars,
model = target(
mock_xgboost(data = my_data, nround = nround, subsample = subsample),
transform = cross(nround = !!c(25, 50, 75), subsample = !!c(0.3, 0.5))
)
)
config <- drake_config(plan_2)
vis_drake_graph(config) plan_2$seed <- NA
plan_2$seed[plan_2$target == "model_25_0.5"] <- diagnose(model_25)$seed
config <- drake_config(plan_2)
recoverable(config)
#> [1] "model_25_0.5"
make(plan_2, recover = TRUE)
#> target model_75_0.5
#> target model_50_0.5
#> recover model_25_0.5
#> target model_75_0.3
#> target model_50_0.3
#> target model_25_0.3 Created on 2019-11-01 by the reprex package (v0.3.0) In more complicated situations, the command might change. You might have to set hyperparameters you forgot about previously. That brings us to #705. If we were to analyze the all the parameters of all the function calls in a command, the performance penalty in the general case would be too severe. The existing triggers are deliberately rigid to enforce reproducibility. I hesitate to add the kind of flexibility you describe because the whole purpose of For the general case, here is an easier workaround: append new targets to the current plan with a custom library(drake)
library(tidyverse)
mock_xgboost <- function(...) {
NULL
}
plan <- drake_plan(
my_data = mtcars,
model = target(
mock_xgboost(data = my_data, nround = nround, subsample = 0.5),
transform = map(nround = 25)
)
)
make(plan)
#> target my_data
#> target model_25
# Custom grid of settings that avoids nround = 25 with subsample = 0.5
grid <- expand_grid(nround = c(25, 50, 75), subsample = c(0.3, 0.5)) %>%
filter(!(nround == 25 & subsample == 0.5))
grid
#> # A tibble: 5 x 2
#> nround subsample
#> <dbl> <dbl>
#> 1 25 0.3
#> 2 50 0.3
#> 3 50 0.5
#> 4 75 0.3
#> 5 75 0.5
addendum <- drake_plan(
model = target(
mock_xgboost(data = my_data, nround = nround, subsample = subsample),
transform = map(.data = !!grid)
)
)
plan <- bind_plans(plan, addendum)
config <- drake_config(plan)
outdated(config) # not model_25
#> [1] "model_25_0.3" "model_50_0.3" "model_50_0.5" "model_75_0.3"
#> [5] "model_75_0.5"
# model_25 is in the plan and up to date
drake_ggraph(config) make(plan) # only 5 targets get built
#> target model_75_0.5
#> target model_75_0.3
#> target model_50_0.5
#> target model_50_0.3
#> target model_25_0.3 Created on 2019-11-01 by the reprex package (v0.3.0) |
Prework
drake
's code of conduct.drake-r-package
tag.Question
Hi Will, long time no see! Hope everything is well.
My question:
What is the correct way to use triggers to avoid
drake
to run dependencies in the plan?I am not sure if this is not related to #685, but it appears to be simpler than that.
What I would like to achieve is to be able to decide beforehand to run certain targets and their dependencies without having to change the plan.
So, for example, if I decide that I don't want analysis X, I flag a condition that says to Drake to skip that target, but, most importantly, all its dependencies.
I made a MRE to show what I tried so far, which was not successfull:
In this plan, Drake correctly configures the targets
analysis_mtcars_high_hp
andanalysis_mtcars_low_hp
as depending onmtcars_high_hp
andmtcars_low_hp
.However, it builds the targets even if the conditions evaluate to
FALSE
.So I am not sure what I am doing wrong here.
The text was updated successfully, but these errors were encountered: