[help] Dynamically link functions used in a `do.call()` as dependencies for target branches #1344

lindsayplatt · 2024-10-08T22:26:49Z

lindsayplatt
Oct 8, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I am working on architecting a pipeline where we have a number of data files with a variety of potential parsers. There may end up being tons of data files, so declaring what parser each data file should get needs to happen dynamically. I would like to match a parser function to the data file in a dynamic branching target (I can do this) AND have that branch depend on the function it will apply (I have not figured this part out). I will separately handle a situation where a file does not having a matching parser, so please ignore that scenario in this use-case.

I have read a discussion thread that talks about setting dependencies when using a do.call() approach (see #831), but the solution provided was to write the dependency within the target command. As I am doing this for an unknown number of target branches which will use different functions, this solution will not work. Below is a reprex to give you a sense of what I am trying to do. Looking forward to thinking through this issue and how targets may or may not be able to help! I drew what I was hoping the DAG would look like with red lines on top of what it is now.

tar_dir({
  tar_script({
    tar_option_set()
    
    parser_typeA <- function(in_file) readRDS(in_file)[4,]
    parser_typeB <- function(in_file) readRDS(in_file)[1,]
    apply_parser <- function(parser_xwalk) {
      fxn <- parser_xwalk$parser_fxn
      args <- list(in_file = parser_xwalk[, 'in_file'])
      do.call(fxn, args)
    }
    
    list(
      
      # 1. Declare files to parse
      tar_target(files_to_parse, 
                 c('my_data_typeA.rds', 'my_data_typeB.rds'),
                 format = 'file'),
      
      # 2. Create a crosswalk between the files and the parser they should use
      tar_target(parser_xwalk, data.frame(in_file = files_to_parse, 
                                          parser_fxn = c('parser_typeA', 'parser_typeB'))),
      
      # 3. Apply each parser to the files based on the crosswalk 
      # *The problem I am having* is that the parser functions are not dependencies
      # Because I don't know how many files I will end up with, I want to dynamically 
      # match a file to a parser and have that branch depend on the function it uses.
      tar_target(parsed_files, 
                 # I've tried `as.symbol(parser_xwalk$parser_fxn)` here but that didn't work.
                 apply_parser(parser_xwalk), 
                 pattern = map(parser_xwalk))
    )
  })
  
  saveRDS(tibble(col1a = c(1:5), col2a = letters[1:5]), 'my_data_typeA.rds')
  saveRDS(tibble(col1b = c(1:5), col2b = letters[1:5]), 'my_data_typeB.rds')
  tar_make()
  
  tar_visnetwork() 
})

Answered by wlandau

Oct 9, 2024

It's tricky to dynamically branch over functions such that a change to one parser function does not invalidate all the branches of parsed_files. It's not elegant, but I think it will work if the actual function body becomes part of parser_xwalk, as opposed to the function name. Since functions can have brittle internals that change hashes unpredictably, the following sketch deparses them to text. This could lose information in the function closure injected by Vectorize(), purrr::safely(), etc., but it might work in your case if your parsers are simple enough.

tar_option_set()

parsers <- list(
  parser_typeA = function(in_file) readRDS(in_file)[4,],
  parser_typeB = function(in_file) read…

View full answer

joelnitta · 2024-10-09T12:13:39Z

joelnitta
Oct 9, 2024

so declaring what parser each data file should get needs to happen dynamically

In your example, you hard-code the parser type in the parser_xwalk target. Would it be possible to determine the needed parser dynamically, say from the file extension?

1 reply

lindsayplatt Oct 9, 2024
Author

Sorry, yes in this reprex I do have parser_xwalk hard-coded. In my actual code, this is dynamically generated based on known file-naming conventions (not extension because they are mostly CSVs). Regardless of how parser_xwalk is generated, I still have the challenge of getting the functions defined in parser_xwalk to show up as dependencies in the branches.

wlandau · 2024-10-09T15:17:01Z

wlandau
Oct 9, 2024
Maintainer

It's tricky to dynamically branch over functions such that a change to one parser function does not invalidate all the branches of parsed_files. It's not elegant, but I think it will work if the actual function body becomes part of parser_xwalk, as opposed to the function name. Since functions can have brittle internals that change hashes unpredictably, the following sketch deparses them to text. This could lose information in the function closure injected by Vectorize(), purrr::safely(), etc., but it might work in your case if your parsers are simple enough.

tar_option_set()

parsers <- list(
  parser_typeA = function(in_file) readRDS(in_file)[4,],
  parser_typeB = function(in_file) readRDS(in_file)[1,]
)

apply_parser <- function(parser_xwalk) {
  fxn <- eval(parse(text = unlist(parser_xwalk$parser_fxn)))
  args <- list(in_file = parser_xwalk[, "in_file"])
  do.call(fxn, args)
}

list(
  # 1. Declare files to parse
  tar_target(
    files_to_parse, 
    c("my_data_typeA.rds", "my_data_typeB.rds"),
    format = "file"
  ),
  
  # 2. Create a crosswalk between the files and the parser they should use
  tar_target(
    parser_xwalk,
    data.frame(
      in_file = files_to_parse, 
      parser_fxn = lapply(parsers, deparse)
    )
  ),
  
  # 3. Apply each parser to the files based on the crosswalk 
  tar_target(
    parsed_files, 
    apply_parser(parser_xwalk), 
    pattern = map(parser_xwalk)
  )
)

3 replies

lindsayplatt Oct 9, 2024
Author

Ahhh yes, OK putting the function contents directly into the crosswalk makes sense. Since I am dynamically creating the crosswalk (in my code, not in this reprex), I think I would want the crosswalk target to depend on a target with all the function content so that it rebuilds when there are any changes. Since this is a simple match of file name to fxn, it shouldn't be computationally expensive to rebuild.

Thank you!

I was able to get the following to run with appropriate dependencies (I tested with a change in parser_typeB() and when I rebuilt, only the second branch re-ran 👍):

tar_dir({
  tar_script({
    library(tarchetypes)
    tar_option_set(packages = 'tidyverse')
    
    parser_typeA <- function(in_file) readRDS(in_file)[4,]
    parser_typeB <- function(in_file) readRDS(in_file)[1,]
    
    apply_parser <- function(parser_xwalk) {
      fxn <- eval(parse(text = unlist(parser_xwalk$parser_fxn)))
      args <- list(in_file = parser_xwalk$in_file)
      do.call(fxn, args)
    }
    
    list(
      # 1. Declare files to parse
      tar_target(
        files_to_parse, 
        c("my_data_typeA.rds", "my_data_typeB.rds"),
        format = "file"
      ),
      
      # 2. Create a crosswalk between the files and the parser they should use
      tar_target(
        parser_xwalk,
        tibble(
          in_file = files_to_parse, 
          # Contents of the parser function exists *inside* of this xwalk
          # so that downstream targets rebuild when there are changes
          parser_fxn = list(deparse(parser_typeA), deparse(parser_typeB))
        )
      ),
      
      # 3. Apply each parser to the files based on the crosswalk 
      tar_target(
        parsed_files, 
        apply_parser(parser_xwalk), 
        pattern = map(parser_xwalk)
      )
    )
  })
  
  saveRDS(data.frame(col1a = c(1:5), col2a = letters[1:5]), 'my_data_typeA.rds')
  saveRDS(data.frame(col1b = c(1:5), col2b = letters[1:5]), 'my_data_typeB.rds')
  tar_make()
  
  tar_visnetwork() 
})

lindsayplatt Oct 9, 2024
Author

One last follow-up here as I wanted to think more about your point about potential "brittleness" with converting to strings and back to functions. Rather than treating the parsers as object dependencies in the pipeline, I am trying a method to treat as files. Then, I can source into the environment and keep as function objects rather than parse/deparse and potentially lose functionality. Here is a reprex showing this adjustment. I think I will move forward with this approach. I did test a change to one of the function files and it rebuilt correctly, rebuilt when I changed the R code but didn't rebuild when I added a comment.

tar_dir({
  tar_script({
    
    tar_option_set(packages = 'tidyverse')
    
    load_parser <- function(fxn_file) {
      # Load the parser file into its own environment
      parser_env <- new.env()
      source(fxn_file, local = parser_env)
      parser_fxn_nm <- ls(envir = parser_env)
      # Require that the file only have one function
      stopifnot(length(parser_fxn_nm) == 1)
      # Return the function
      return(parser_env[[parser_fxn_nm]])
    }
    
    apply_parser <- function(parser_xwalk) {
      # `unlist()` wasn't working but `[[1]]` is fine 
      # since we already checked that there is only one fxn
      fxn <- parser_xwalk$parser_fxn[[1]]
      args <- list(in_file = parser_xwalk$in_file)
      do.call(fxn, args)
    }
    
    list(
      # 1. Declare files to parse
      tar_target(
        files_to_parse, 
        c("my_data_typeA.rds", "my_data_typeB.rds"),
        format = "file"
      ),
      
      # 2. Declare fxn files as targets
      tar_target(
        fxn_files, 
        c("parser_typeA.R", "parser_typeB.R"),
        format = "file"
      ),
      
      # 2. Create a crosswalk between the files and the parser they should use
      tar_target(
        parser_xwalk,
        data.frame(
          in_file = files_to_parse, 
          # Depend on the files with the single parser fxn to avoid brittleness
          parser_fxn_file = fxn_files
        ) %>% 
          rowwise() %>% 
          mutate(parser_fxn = list(load_parser(parser_fxn_file)))
      ),
      
      # 3. Apply each parser to the files based on the crosswalk 
      tar_target(
        parsed_files, 
        apply_parser(parser_xwalk), 
        pattern = map(parser_xwalk)
      )
    )
  })
  
  # One file per parser fxn
  writeLines('parser_typeA <- function(in_file) readRDS(in_file)[4,]', 'parser_typeA.R')
  writeLines('parser_typeB <- function(in_file) readRDS(in_file)[1,]', 'parser_typeB.R')
  saveRDS(data.frame(col1 = c(1:5), col2 = letters[1:5]), 'my_data_typeA.rds')
  saveRDS(data.frame(col1 = c(1:5), col2 = letters[1:5]), 'my_data_typeB.rds')
  
  tar_make()
  
  tar_visnetwork() 
})

lindsayplatt Oct 10, 2024
Author

My only other challenge with this approach now is that I can't capture other function dependencies - e.g. a parser function calls another custom function and will rebuild when that other function also updates. Here is a reprex of this with a function called helper_fxn().

tar_dir({
  tar_script({
    
    tar_option_set(packages = 'tidyverse')
    
    load_parser <- function(fxn_file) {
      # Load the parser file into its own environment
      parser_env <- new.env()
      source(fxn_file, local = parser_env)
      parser_fxn_nm <- ls(envir = parser_env)
      # Require that the file only have one function
      stopifnot(length(parser_fxn_nm) == 1)
      # Return the function
      return(parser_env[[parser_fxn_nm]])
    }
    
    apply_parser <- function(parser_xwalk) {
      # `unlist()` wasn't working but `[[1]]` is fine 
      # since we already checked that there is only one fxn
      fxn <- parser_xwalk$parser_fxn[[1]]
      args <- list(in_file = parser_xwalk$in_file)
      do.call(fxn, args)
    }
    
    source('helper_fxn.R')
    
    list(
      # 1. Declare files to parse
      tar_target(
        files_to_parse, 
        c("my_data_typeA.rds", "my_data_typeB.rds"),
        format = "file"
      ),
      
      # 2. Declare fxn files as targets
      tar_target(
        fxn_files, 
        c("parser_typeA.R", "parser_typeB.R"),
        format = "file"
      ),
      
      # 2. Create a crosswalk between the files and the parser they should use
      tar_target(
        parser_xwalk,
        data.frame(
          in_file = files_to_parse, 
          # Depend on the files with the single parser fxn to avoid brittleness
          parser_fxn_file = fxn_files
        ) %>% 
          rowwise() %>% 
          mutate(parser_fxn = list(load_parser(parser_fxn_file)))
      ),
      
      # 3. Apply each parser to the files based on the crosswalk 
      tar_target(
        parsed_files, 
        apply_parser(parser_xwalk), 
        pattern = map(parser_xwalk)
      )
    )
  })
  # Shared function stored elsewhere
  writeLines('helper_fxn <- function(i) i+1', 'helper_fxn.R')
  # One file per parser fxn
  writeLines('parser_typeA <- function(in_file) readRDS(in_file)[4,]', 'parser_typeA.R')
  writeLines('parser_typeB <- function(in_file) readRDS(in_file)[helper_fxn(1),]', 'parser_typeB.R')
  saveRDS(data.frame(col1 = c(1:5), col2 = letters[1:5]), 'my_data_typeA.rds')
  saveRDS(data.frame(col1 = c(1:5), col2 = letters[1:5]), 'my_data_typeB.rds')
  
  tar_make()
  
  tar_visnetwork() 
})

I think I may need to just implement a rule that parsers must be tiny and not call on other custom functions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] Dynamically link functions used in a `do.call()` as dependencies for target branches #1344

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[help] Dynamically link functions used in a do.call() as dependencies for target branches #1344

lindsayplatt Oct 8, 2024

Help

Description

Replies: 2 comments · 4 replies

joelnitta Oct 9, 2024

lindsayplatt Oct 9, 2024 Author

wlandau Oct 9, 2024 Maintainer

lindsayplatt Oct 9, 2024 Author

lindsayplatt Oct 9, 2024 Author

lindsayplatt Oct 10, 2024 Author

[help] Dynamically link functions used in a `do.call()` as dependencies for target branches #1344

lindsayplatt
Oct 8, 2024

Replies: 2 comments 4 replies

joelnitta
Oct 9, 2024

lindsayplatt Oct 9, 2024
Author

wlandau
Oct 9, 2024
Maintainer

lindsayplatt Oct 9, 2024
Author

lindsayplatt Oct 9, 2024
Author

lindsayplatt Oct 10, 2024
Author