Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Condition Trigger to affect dependencies using cross #1041

Closed
3 tasks done
nettoyoussef opened this issue Oct 30, 2019 · 7 comments
Closed
3 tasks done

Question: Condition Trigger to affect dependencies using cross #1041

nettoyoussef opened this issue Oct 30, 2019 · 7 comments

Comments

@nettoyoussef
Copy link

Prework

Question

Hi Will, long time no see! Hope everything is well.

My question:
What is the correct way to use triggers to avoid drake to run dependencies in the plan?
I am not sure if this is not related to #685, but it appears to be simpler than that.

What I would like to achieve is to be able to decide beforehand to run certain targets and their dependencies without having to change the plan.

So, for example, if I decide that I don't want analysis X, I flag a condition that says to Drake to skip that target, but, most importantly, all its dependencies.

I made a MRE to show what I tried so far, which was not successfull:

# libraries
library(dplyr)
library(here)
library(drake)

cache_path <- here::here('.drake')
cache <- storr::storr_rds(cache_path, compress = FALSE)

# These are the preconditions
high_hp = FALSE
low_hp = FALSE

# starts the plan
plan <- drake_plan(
  mtcars = mtcars,
  
  mtcars_high_hp =
    target(
      command = dplyr::filter(mtcars, hp > 150),
      trigger = trigger(condition = isTRUE(!!high_hp), mode = "condition")
    ),
  
  mtcars_low_hp =
    target(
      command = dplyr::filter(mtcars, hp < 110),
      trigger = trigger(condition = isTRUE(!!low_hp), mode = "condition")
    ),
  
  #Perform analysis
  analysis =
    target(
      command = mutate(data, performance = hp / cyl),
      transform =
        cross(data = c(mtcars, mtcars_high_hp, mtcars_low_hp))
    )
)

# Configure plan

config<- drake_config(plan, verbose = 2, cache = cache)
make(config = config)

# The targets are built even if the condition evaluates to false, though
cached(path = cache_path)
readd(mtcars_low_hp, path = cache_path)
readd(analysis_mtcars_low_hp, path = cache_path)

In this plan, Drake correctly configures the targets analysis_mtcars_high_hp and analysis_mtcars_low_hp as depending on mtcars_high_hp and mtcars_low_hp .

However, it builds the targets even if the conditions evaluate to FALSE.
So I am not sure what I am doing wrong here.

@nettoyoussef nettoyoussef changed the title Condition Trigger to affect dependencies using cross Question: Condition Trigger to affect dependencies using cross Oct 30, 2019
@nettoyoussef
Copy link
Author

nettoyoussef commented Oct 30, 2019

Maybe the problem itself was already related to the target not existing beforehand, such as in #616.

If so, what would be the alternative to using triggers? I am devising a pipeline that will deal with data in different conditions, so for some of them I would like to run all parts of the basic plan, and for others I would preffer to skip some of those.

Think of this as a general data treatment pipeline, where data in different conditions would be treated accordingly. But the intent would be to avoid creating a new plan for each new data.

@wlandau
Copy link
Member

wlandau commented Oct 30, 2019

I think the targets argument of make() covers what you describe, especially since you know what you want in advance.

library(drake)
plan <- drake_plan(
  x = target(w, transform = map(w = !!seq_len(1e2))),
  y = target(x, transform = map(x, .id = w)),
  z = target(y, transform = map(y, .id = w))          
)
nrow(plan) # lots of targets
#> [1] 300
plot(plan) # dev version only

make(plan, targets = c("z_1L", "z_2L")) # only run some
#> target x_1L
#> target x_2L
#> target y_1L
#> target y_2L
#> target z_1L
#> target z_2L

Created on 2019-10-30 by the reprex package (v0.3.0)

@wlandau wlandau closed this as completed Oct 30, 2019
@wlandau
Copy link
Member

wlandau commented Oct 30, 2019

The issue with triggers is they always require the target to exist before any kind of skipping.

@nettoyoussef
Copy link
Author

nettoyoussef commented Oct 31, 2019

Thanks Will! That solved it for me.

Maybe, given #616 #685 #935 you would like to think in a more flexible way to handle Drake's cache and target update going forward.

For example, the triggers and targets could be unified in an eventual refactoring and other rules not attached to the name of target could be created. That way, the problem described in #935 would be solved.

Not that it is an easy task, I am just giving you food for thought. Tell me if I can help with anything.

@wlandau
Copy link
Member

wlandau commented Oct 31, 2019

Is it just the issue of renaming targets? What other kinds of problems are you thinking about exactly?

There is a way to rename a target without necessarily incurring a rebuild under certain conditions. If you keep the RNG seed constant (target(..., seed = same_as_last_time)) you can call make(recover = TRUE). But this solution will still invalidate targets downstream.

For complete freedom to rename targets with impunity, drake would need to create stable internal names. Dependency hashes are a natural choice, but we cannot know until make() is actually running. Take this plan:

plan <- drake_plan(
  data = get_data(file_in("https://example.com")),
  munge = munge_data(data),
  analysis = analyze(munge)
)

When it comes time to run munge, drake would need to alias it with a name like target_fc2f64e0 and then substitute the alias in all the downstream commands. The aliases would create extra key files in the cache, exacerbating an existing performance issue (#1025 etc.) to say nothing of all the heavy refactoring this approach would require. Currently, I do not think it is worth the effort because it is a lot of work and a lot of risk for something that does not come up very much.

@nettoyoussef
Copy link
Author

Is it just the issue of renaming targets? What other kinds of problems are you thinking about exactly?

No - the names themselves are not important. The issue is with changing the plan's parameters or the function calls, which makes Drake rebuild the targets unnecessarily.

I think it is a common way to program:

  • you start with some simple functions that receive some parameters.
  • you add a couple of features and change the function arguments.
  • You decide you don't need some of the features anymore and change back its signature etc.

Each of those steps would make Drake update all the targets and dependencies from that point onwards, even if the parameters evaluate to the same function specification. If the targets are slow to build, this creates a big overhead.

Of course, it would be impossible to create a parser that understands exactly each and any modification you do in a function call, but some simple examples such as in the one provided in #935 are not so complex to implement, I think.

If we could do this at least with the commands of cross, map and combine it would be a long way for hyperparameter optimization:

  • you create a first exploration of the space of parameters using map
  • the results are unsatisfactory
  • you decide to expand the range of the parameters explored - this adds more targets to build, but you don't need to rebuild the previous ones.
  • notice that the function body is precisely the same - you are only adding targets.

An MRE to make what I am saying more clear:

library(xgboost)
library(drake)

plan_1 <- 
    drake_plan(
           my_data = mtcars,
           model = target(
                          xgboost(
                            data = my_data,
                            eta = 0.1, 
                            max_depth = 10,
                            nround= nround,
                            subsample = 0.5,
                            colsample_bytree = 0.5,
                            seed = 1,
                            eval_metric = "auc",
                            verbose = 0,
                            nthread = 1,
                            objective = "binary:logistic"),
                          
                          transform = map( nround = 25)
                        )
      )

plan_2 <- 
    drake_plan(
           my_data = mtcars,
           model = target(
                          xgboost(
                            data = my_data,
                            eta = 0.1, 
                            max_depth = 10,
                            nround= nround,
                            subsample = subsample,
                            colsample_bytree = 0.5,
                            seed = 1,
                            eval_metric = "auc",
                            verbose = 0,
                            nthread = 1,
                            objective = "binary:logistic"),
                          
                          transform = cross( nround = !!c(25, 50, 75), subsample = !!c(0.3, 0.5))
                        )
      )

If you couple this with the triggers feature, that can't stop you for building specific targets if they are not built before, and the targets argument, which forces you to select the items you want to run manually, there seems to be a certain overlap of features that may be hard to maintain but, at the same time, are not so flexible.

I think this overall discussion enters the scope of #685.

A more flexible way to evaluate the plan leads to the decision of what to build and create in real-time, based on the specification of the targets built before.

A sketch of the structure could be, instead of the current one, something that prioritizes, from high to low:

  • what the user wants (triggers)
  • modification in the timestamp of the input data
  • modification in the function body (discarding things such as white space etc)
  • modification in the function parameters (call)
  • missing

That would create some overhead since Drake would have to hash all functions in the global environment in real time. And of course, I still don't have a deep view of Drake internals.

You are in a much better position to evaluate what is factible or not, and the effort that woud be needed.

@wlandau
Copy link
Member

wlandau commented Nov 1, 2019

If you add new hyperparameter combos without changing the commands of old targets, you can use data recovery to rename old targets. Below, model_25 gets assigned to model_25_0.5 using make(plan_2, recover = TRUE). Note that we have to explicitly get the old seed from model_25.

library(drake)

mock_xgboost <- function(...) {
  NULL
}

plan_1 <- drake_plan(
  my_data = mtcars,
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = 0.5),
    transform = map(nround = 25)
  )
)

make(plan_1)
#> target my_data
#> target model_25

plan_2 <- drake_plan(
  my_data = mtcars,
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = subsample),
    transform = cross(nround = !!c(25, 50, 75), subsample = !!c(0.3, 0.5))
  )
)

config <- drake_config(plan_2)
vis_drake_graph(config)

plan_2$seed <- NA
plan_2$seed[plan_2$target == "model_25_0.5"] <- diagnose(model_25)$seed

config <- drake_config(plan_2)
recoverable(config)
#> [1] "model_25_0.5"

make(plan_2, recover = TRUE)
#> target model_75_0.5
#> target model_50_0.5
#> recover model_25_0.5
#> target model_75_0.3
#> target model_50_0.3
#> target model_25_0.3

Created on 2019-11-01 by the reprex package (v0.3.0)

In more complicated situations, the command might change. You might have to set hyperparameters you forgot about previously. That brings us to #705. If we were to analyze the all the parameters of all the function calls in a command, the performance penalty in the general case would be too severe.

The existing triggers are deliberately rigid to enforce reproducibility. I hesitate to add the kind of flexibility you describe because the whole purpose of drake is to reduce human decisions about what is up to date.

For the general case, here is an easier workaround: append new targets to the current plan with a custom .data grid in map(). That way, you can avoid hyperparameter combos you already tried. The plan may look less elegant, but you avoid rerunning old targets.

library(drake)
library(tidyverse)

mock_xgboost <- function(...) {
  NULL
}

plan <- drake_plan(
  my_data = mtcars,
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = 0.5),
    transform = map(nround = 25)
  )
)

make(plan)
#> target my_data
#> target model_25

# Custom grid of settings that avoids nround = 25 with subsample = 0.5
grid <- expand_grid(nround = c(25, 50, 75), subsample = c(0.3, 0.5)) %>%
  filter(!(nround == 25 & subsample == 0.5))
grid
#> # A tibble: 5 x 2
#>   nround subsample
#>    <dbl>     <dbl>
#> 1     25       0.3
#> 2     50       0.3
#> 3     50       0.5
#> 4     75       0.3
#> 5     75       0.5

addendum <- drake_plan(
  model = target(
    mock_xgboost(data = my_data, nround = nround, subsample = subsample),
    transform = map(.data = !!grid)
  )
)

plan <- bind_plans(plan, addendum)

config <- drake_config(plan)

outdated(config) # not model_25
#> [1] "model_25_0.3" "model_50_0.3" "model_50_0.5" "model_75_0.3"
#> [5] "model_75_0.5"

# model_25 is in the plan and up to date
drake_ggraph(config)

make(plan) # only 5 targets get built
#> target model_75_0.5
#> target model_75_0.3
#> target model_50_0.5
#> target model_50_0.3
#> target model_25_0.3

Created on 2019-11-01 by the reprex package (v0.3.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants