Question: Condition Trigger to affect dependencies using cross #1041

nettoyoussef · 2019-10-30T15:16:41Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.
Consider instead posting to Stack Overflow under the drake-r-package tag.

Question

Hi Will, long time no see! Hope everything is well.

My question:
What is the correct way to use triggers to avoid drake to run dependencies in the plan?
I am not sure if this is not related to #685, but it appears to be simpler than that.

What I would like to achieve is to be able to decide beforehand to run certain targets and their dependencies without having to change the plan.

So, for example, if I decide that I don't want analysis X, I flag a condition that says to Drake to skip that target, but, most importantly, all its dependencies.

I made a MRE to show what I tried so far, which was not successfull:

# libraries
library(dplyr)
library(here)
library(drake)

cache_path <- here::here('.drake')
cache <- storr::storr_rds(cache_path, compress = FALSE)

# These are the preconditions
high_hp = FALSE
low_hp = FALSE

# starts the plan
plan <- drake_plan(
  mtcars = mtcars,
  
  mtcars_high_hp =
    target(
      command = dplyr::filter(mtcars, hp > 150),
      trigger = trigger(condition = isTRUE(!!high_hp), mode = "condition")
    ),
  
  mtcars_low_hp =
    target(
      command = dplyr::filter(mtcars, hp < 110),
      trigger = trigger(condition = isTRUE(!!low_hp), mode = "condition")
    ),
  
  #Perform analysis
  analysis =
    target(
      command = mutate(data, performance = hp / cyl),
      transform =
        cross(data = c(mtcars, mtcars_high_hp, mtcars_low_hp))
    )
)

# Configure plan

config<- drake_config(plan, verbose = 2, cache = cache)
make(config = config)

# The targets are built even if the condition evaluates to false, though
cached(path = cache_path)
readd(mtcars_low_hp, path = cache_path)
readd(analysis_mtcars_low_hp, path = cache_path)

In this plan, Drake correctly configures the targets analysis_mtcars_high_hp and analysis_mtcars_low_hp as depending on mtcars_high_hp and mtcars_low_hp .

However, it builds the targets even if the conditions evaluate to FALSE.
So I am not sure what I am doing wrong here.

The text was updated successfully, but these errors were encountered:

nettoyoussef · 2019-10-30T15:24:28Z

Maybe the problem itself was already related to the target not existing beforehand, such as in #616.

If so, what would be the alternative to using triggers? I am devising a pipeline that will deal with data in different conditions, so for some of them I would like to run all parts of the basic plan, and for others I would preffer to skip some of those.

Think of this as a general data treatment pipeline, where data in different conditions would be treated accordingly. But the intent would be to avoid creating a new plan for each new data.

wlandau · 2019-10-30T16:46:53Z

I think the targets argument of make() covers what you describe, especially since you know what you want in advance.

library(drake)
plan <- drake_plan(
  x = target(w, transform = map(w = !!seq_len(1e2))),
  y = target(x, transform = map(x, .id = w)),
  z = target(y, transform = map(y, .id = w))          
)
nrow(plan) # lots of targets
#> [1] 300
plot(plan) # dev version only

make(plan, targets = c("z_1L", "z_2L")) # only run some
#> target x_1L
#> target x_2L
#> target y_1L
#> target y_2L
#> target z_1L
#> target z_2L

^{Created on 2019-10-30 by the reprex package (v0.3.0)}

wlandau · 2019-10-30T16:47:32Z

The issue with triggers is they always require the target to exist before any kind of skipping.

nettoyoussef · 2019-10-31T13:13:47Z

Thanks Will! That solved it for me.

Maybe, given #616 #685 #935 you would like to think in a more flexible way to handle Drake's cache and target update going forward.

For example, the triggers and targets could be unified in an eventual refactoring and other rules not attached to the name of target could be created. That way, the problem described in #935 would be solved.

Not that it is an easy task, I am just giving you food for thought. Tell me if I can help with anything.

wlandau · 2019-10-31T20:33:55Z

Is it just the issue of renaming targets? What other kinds of problems are you thinking about exactly?

There is a way to rename a target without necessarily incurring a rebuild under certain conditions. If you keep the RNG seed constant (target(..., seed = same_as_last_time)) you can call make(recover = TRUE). But this solution will still invalidate targets downstream.

For complete freedom to rename targets with impunity, drake would need to create stable internal names. Dependency hashes are a natural choice, but we cannot know until make() is actually running. Take this plan:

plan <- drake_plan(
  data = get_data(file_in("https://example.com")),
  munge = munge_data(data),
  analysis = analyze(munge)
)

When it comes time to run munge, drake would need to alias it with a name like target_fc2f64e0 and then substitute the alias in all the downstream commands. The aliases would create extra key files in the cache, exacerbating an existing performance issue (#1025 etc.) to say nothing of all the heavy refactoring this approach would require. Currently, I do not think it is worth the effort because it is a lot of work and a lot of risk for something that does not come up very much.

nettoyoussef · 2019-11-01T12:14:14Z

Is it just the issue of renaming targets? What other kinds of problems are you thinking about exactly?

No - the names themselves are not important. The issue is with changing the plan's parameters or the function calls, which makes Drake rebuild the targets unnecessarily.

I think it is a common way to program:

you start with some simple functions that receive some parameters.
you add a couple of features and change the function arguments.
You decide you don't need some of the features anymore and change back its signature etc.

Each of those steps would make Drake update all the targets and dependencies from that point onwards, even if the parameters evaluate to the same function specification. If the targets are slow to build, this creates a big overhead.

Of course, it would be impossible to create a parser that understands exactly each and any modification you do in a function call, but some simple examples such as in the one provided in #935 are not so complex to implement, I think.

If we could do this at least with the commands of cross, map and combine it would be a long way for hyperparameter optimization:

you create a first exploration of the space of parameters using map
the results are unsatisfactory
you decide to expand the range of the parameters explored - this adds more targets to build, but you don't need to rebuild the previous ones.
notice that the function body is precisely the same - you are only adding targets.

An MRE to make what I am saying more clear:

library(xgboost)
library(drake)

plan_1 <- 
    drake_plan(
           my_data = mtcars,
           model = target(
                          xgboost(
                            data = my_data,
                            eta = 0.1, 
                            max_depth = 10,
                            nround= nround,
                            subsample = 0.5,
                            colsample_bytree = 0.5,
                            seed = 1,
                            eval_metric = "auc",
                            verbose = 0,
                            nthread = 1,
                            objective = "binary:logistic"),
                          
                          transform = map( nround = 25)
                        )
      )

plan_2 <- 
    drake_plan(
           my_data = mtcars,
           model = target(
                          xgboost(
                            data = my_data,
                            eta = 0.1, 
                            max_depth = 10,
                            nround= nround,
                            subsample = subsample,
                            colsample_bytree = 0.5,
                            seed = 1,
                            eval_metric = "auc",
                            verbose = 0,
                            nthread = 1,
                            objective = "binary:logistic"),
                          
                          transform = cross( nround = !!c(25, 50, 75), subsample = !!c(0.3, 0.5))
                        )
      )

If you couple this with the triggers feature, that can't stop you for building specific targets if they are not built before, and the targets argument, which forces you to select the items you want to run manually, there seems to be a certain overlap of features that may be hard to maintain but, at the same time, are not so flexible.

I think this overall discussion enters the scope of #685.

A more flexible way to evaluate the plan leads to the decision of what to build and create in real-time, based on the specification of the targets built before.

A sketch of the structure could be, instead of the current one, something that prioritizes, from high to low:

what the user wants (triggers)
modification in the timestamp of the input data
modification in the function body (discarding things such as white space etc)
modification in the function parameters (call)
missing

That would create some overhead since Drake would have to hash all functions in the global environment in real time. And of course, I still don't have a deep view of Drake internals.

You are in a much better position to evaluate what is factible or not, and the effort that woud be needed.

wlandau · 2019-11-01T15:17:27Z

If you add new hyperparameter combos without changing the commands of old targets, you can use data recovery to rename old targets. Below, model_25 gets assigned to model_25_0.5 using make(plan_2, recover = TRUE). Note that we have to explicitly get the old seed from model_25.

library(drake)

mock_xgboost <- function(...) {
  NULL
}

plan_1 <- drake_plan(
  my_data = mtcars,
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = 0.5),
    transform = map(nround = 25)
  )
)

make(plan_1)
#> target my_data
#> target model_25

plan_2 <- drake_plan(
  my_data = mtcars,
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = subsample),
    transform = cross(nround = !!c(25, 50, 75), subsample = !!c(0.3, 0.5))
  )
)

config <- drake_config(plan_2)
vis_drake_graph(config)

plan_2$seed <- NA
plan_2$seed[plan_2$target == "model_25_0.5"] <- diagnose(model_25)$seed

config <- drake_config(plan_2)
recoverable(config)
#> [1] "model_25_0.5"

make(plan_2, recover = TRUE)
#> target model_75_0.5
#> target model_50_0.5
#> recover model_25_0.5
#> target model_75_0.3
#> target model_50_0.3
#> target model_25_0.3

^{Created on 2019-11-01 by the reprex package (v0.3.0)}

In more complicated situations, the command might change. You might have to set hyperparameters you forgot about previously. That brings us to #705. If we were to analyze the all the parameters of all the function calls in a command, the performance penalty in the general case would be too severe.

The existing triggers are deliberately rigid to enforce reproducibility. I hesitate to add the kind of flexibility you describe because the whole purpose of drake is to reduce human decisions about what is up to date.

For the general case, here is an easier workaround: append new targets to the current plan with a custom .data grid in map(). That way, you can avoid hyperparameter combos you already tried. The plan may look less elegant, but you avoid rerunning old targets.

library(drake)
library(tidyverse)

mock_xgboost <- function(...) {
  NULL
}

plan <- drake_plan(
  my_data = mtcars,
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = 0.5),
    transform = map(nround = 25)
  )
)

make(plan)
#> target my_data
#> target model_25

# Custom grid of settings that avoids nround = 25 with subsample = 0.5
grid <- expand_grid(nround = c(25, 50, 75), subsample = c(0.3, 0.5)) %>%
  filter(!(nround == 25 & subsample == 0.5))
grid
#> # A tibble: 5 x 2
#>   nround subsample
#>    <dbl>     <dbl>
#> 1     25       0.3
#> 2     50       0.3
#> 3     50       0.5
#> 4     75       0.3
#> 5     75       0.5

addendum <- drake_plan(
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = subsample),
    transform = map(.data = !!grid)
  )
)

plan <- bind_plans(plan, addendum)

config <- drake_config(plan)

outdated(config) # not model_25
#> [1] "model_25_0.3" "model_50_0.3" "model_50_0.5" "model_75_0.3"
#> [5] "model_75_0.5"

# model_25 is in the plan and up to date
drake_ggraph(config)

make(plan) # only 5 targets get built
#> target model_75_0.5
#> target model_75_0.3
#> target model_50_0.5
#> target model_50_0.3
#> target model_25_0.3

^{Created on 2019-11-01 by the reprex package (v0.3.0)}

nettoyoussef added the type: question label Oct 30, 2019

nettoyoussef changed the title ~~Condition Trigger to affect dependencies using cross~~ Question: Condition Trigger to affect dependencies using cross Oct 30, 2019

wlandau closed this as completed Oct 30, 2019

hansvancalster mentioned this issue Nov 14, 2019

conditional execution of dynamic subtargets #1066

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Condition Trigger to affect dependencies using cross #1041

Question: Condition Trigger to affect dependencies using cross #1041

nettoyoussef commented Oct 30, 2019

nettoyoussef commented Oct 30, 2019 •

edited

Loading

wlandau commented Oct 30, 2019

wlandau commented Oct 30, 2019

nettoyoussef commented Oct 31, 2019 •

edited

Loading

wlandau commented Oct 31, 2019 •

edited

Loading

nettoyoussef commented Nov 1, 2019

wlandau commented Nov 1, 2019 •

edited

Loading

Question: Condition Trigger to affect dependencies using cross #1041

Question: Condition Trigger to affect dependencies using cross #1041

Comments

nettoyoussef commented Oct 30, 2019

Prework

Question

nettoyoussef commented Oct 30, 2019 • edited Loading

wlandau commented Oct 30, 2019

wlandau commented Oct 30, 2019

nettoyoussef commented Oct 31, 2019 • edited Loading

wlandau commented Oct 31, 2019 • edited Loading

nettoyoussef commented Nov 1, 2019

wlandau commented Nov 1, 2019 • edited Loading

nettoyoussef commented Oct 30, 2019 •

edited

Loading

nettoyoussef commented Oct 31, 2019 •

edited

Loading

wlandau commented Oct 31, 2019 •

edited

Loading

wlandau commented Nov 1, 2019 •

edited

Loading