replicates within tar_map() [help] #173

mrguyperson · 2024-04-22T12:39:21Z

mrguyperson
Apr 22, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I have built a pipeline for processing data, training/evaluating models and making predictions. It makes heavy use of tar_map() because the pipeline runs for several different datasets, and then builds a set of models for each of those datasets, etc. I would now like to replicate the pipeline a number of times so that I can calculate average performance metrics for the test set. I thought I could make use of tar_rep() or tar_map_rep() in my existing pipeline, but I seem to not understand how these functions are to be used in a pipeline like mine. Suppose the following simple example:

library(targets)
library(tarchetypes)
library(tibble)
library(dplyr)

tar_option_set(
  packages = c("tidymodels"),
  seed = 1
)

make_data <- function(){
    tibble(
        a = runif(100),
        b = runif(100),
        outcome = rbinom(n = 100, size = 1, prob = 0.75)
    )
}

datasets <- tibble(
  data = syms(c("data_a", "data_b")),
  stratum = "outcome",
  names = c("a", "b")
)

data_pipeline <- tar_map(
  values = datasets,
  names = "names",
  tar_target(
    data_splits,
    initial_split(data, strata = stratum)
  ),
  tar_target(
    training_data,
    training(data_splits)
  ),
  tar_target(
    testing_data,
    testing(data_splits)
  ),
  tar_target(
    folds,
    vfold_cv(training_data)
  )
)

list(
  tar_target(
    data_a,
    make_data()
  ),
  tar_target(
    data_b,
    make_data()
 ),
  data_pipeline
)

I tried using tar_rep() on the first target of tar_make() (data_splits) and then using pattern = map(*), for the remaining targets. However, the data_splits targets end up as nested lists that the commands in the later targets fail with. I also thought to break apart tar_map() into an initial instance of tar_map_rep() for data_splits but that also did not seem to work. Am I approaching this problem entirely wrong? I tried using tar_manifest() but I honestly wasn't sure how to make use of the output for my various flavors of data_splits(). Thanks for any clarity you can offer (and for this set of tools that have greatly improved my workflow).

Answered by wlandau

Apr 23, 2024

It's tough when there are different targets for so many different versions of the data: the raw data, splits, training, testing, and folds. For simulation studies, I often recommend making each branch its own end-to-end simulation replication: generate the data, run a model, and report compact metrics that can be summarized across reps. If needed, the command supplied to tar_map_rep() can share an upstream raw data object, and each simulation rep can split into a different set of folds and training/testing data. An example for clinical trial simulation is at https://github.com/wlandau/rpharma2023-pipeline.

View full answer

wlandau · 2024-04-23T20:55:44Z

wlandau
Apr 23, 2024
Maintainer

It's tough when there are different targets for so many different versions of the data: the raw data, splits, training, testing, and folds. For simulation studies, I often recommend making each branch its own end-to-end simulation replication: generate the data, run a model, and report compact metrics that can be summarized across reps. If needed, the command supplied to tar_map_rep() can share an upstream raw data object, and each simulation rep can split into a different set of folds and training/testing data. An example for clinical trial simulation is at https://github.com/wlandau/rpharma2023-pipeline.

3 replies

mrguyperson Apr 24, 2024
Author

Thank you for the prompt reply and the example. If I am understanding correctly, you recommend putting all the steps such as splitting data, fitting the model, and testing the model into a single function, and then using tar_map_rep()? This, e.g.,:

datasets <-
  tibble(
    data = syms(c("data_a", "data_b"))
  )

list(
  tar_target(
    data_a,
    make_data()
  ),
  tar_target(
    data_b,
    make_data()
  ),
  tar_map_rep(
    sims,
    simulation(data),
    values = datasets,
    names = all_of(c("data")),
    columns = all_of(c("data")),
    batches = 8,
    reps = 5
  )
)

That seems to solve the problem, though with slightly less transparency than I had before (but I mostly only used that granularity for debugging anyway).

A quick question on batches: is the main advantage of splitting into batches and then reps that batches can be split among workers and then each worker goes through the reps, or is there something else that I am missing?

wlandau Apr 24, 2024
Maintainer

Yes, this is exactly the kind of pattern that I have seen work well for large simulations.

though with slightly less transparency than I had before (but I mostly only used that granularity for debugging anyway).

In your sketch, simulation() could proactively return compact metrics you might need in case there is a problem with individual simulations and you need to inspect.

A quick question on batches: is the main advantage of splitting into batches and then reps that batches can be split among workers and then each worker goes through the reps, or is there something else that I am missing?

In tar_map_rep(), each batch is a dynamic branch target which runs multiple reps. This allows you to find the right tradeoff between granularity and overhead. You can effectively disable the batching scheme by setting reps = 1, which would make each dynamic branch / batch only one simulation rep, but it could create extra overhead from targets' side to if you have tens of thousands of total simulation replication.

mrguyperson Apr 25, 2024
Author

Great, thank you! This has been very helpful. I appreciate your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replicates within tar_map() [help] #173

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

replicates within tar_map() [help] #173

mrguyperson Apr 22, 2024

Help

Description

Replies: 1 comment · 3 replies

wlandau Apr 23, 2024 Maintainer

mrguyperson Apr 24, 2024 Author

wlandau Apr 24, 2024 Maintainer

mrguyperson Apr 25, 2024 Author

mrguyperson
Apr 22, 2024

Replies: 1 comment 3 replies

wlandau
Apr 23, 2024
Maintainer

mrguyperson Apr 24, 2024
Author

wlandau Apr 24, 2024
Maintainer

mrguyperson Apr 25, 2024
Author