Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic branching and file_out() directories #1141

Closed
3 tasks done
wlandau opened this issue Jan 20, 2020 · 18 comments
Closed
3 tasks done

Dynamic branching and file_out() directories #1141

wlandau opened this issue Jan 20, 2020 · 18 comments

Comments

@wlandau
Copy link
Member

wlandau commented Jan 20, 2020

Prework

EDIT

Best solution yet: #1178

Description

When a file_out() directory is used in combination with dynamic branching, it is possible to fool drake into accepting an old set of files. file_out() is unique among all the triggers because it is part of the output of a target, not the input. That means if any time a target gets built, all its file-out() files are automatically declared up to date. I believe this is what causes the trouble.

There may be clever ways to work with hashes to invalidate sub-targets more readily. However, it is tricky because, again, a file_out() is an end product, and dynamic branching requires we compute sub-target hashes before building the sub-targets (the hash is part of the name).

I propose we throw an error if there are file_out() files for dynamic targets.

Related: #1140

Reproducible example

Provide a minimal reproducible example with code and output that demonstrates the problem. The reprex() function from the reprex package is extremely helpful for this.

To help us read your code, please try to follow the tidyverse style guide. The style_text() and style_file() functions from the styler package make it easier.

Expected result

library(drake)
fs::dir_create("all_figures")

write_file <- function(x, dir) {
  dir <- file.path(dir, x)
  writeLines("lines", dir)
}

plan <- drake_plan(
  file_names = c(1L, 2L, 3L, 4L),
  write_files = target(
    write_file(file_names, dir = file_out("all_figures")),
    dynamic = map(file_names)
  )
)

make(plan)
#> target file_names
#> dynamic write_files
#> subtarget write_files_0b3474bd
#> subtarget write_files_b2a5c9b8
#> subtarget write_files_71f311ad
#> subtarget write_files_98cf3c11
#> aggregate write_files

# Change what the file is supposed to contain
write_file <- function(x, dir) {
  dir <- file.path(dir, x)
  writeLines("lines2", dir)
}

plan <- drake_plan(
  file_names = c(1L, 2L),
  write_files = target(
    write_file(file_names, dir = file_out("all_figures")),
    dynamic = map(file_names)
  )
)

make(plan)
#> target file_names
#> dynamic write_files
#> subtarget write_files_0b3474bd
#> subtarget write_files_b2a5c9b8
#> aggregate write_files

# Go back to all the files.
plan <- drake_plan(
  file_names = c(1L, 2L, 3L, 4L),
  write_files = target(
    write_file(file_names, dir = file_out("all_figures")),
    dynamic = map(file_names)
  )
)

make(plan)
#> target file_names
#> dynamic write_files
#> aggregate write_files

readLines("all_figures/3")
#> [1] "lines"

Created on 2020-01-20 by the reprex package (v0.3.0)

Expected result

readLines("all_figures/3") should return "lines2". Better to avoid file_out() + dynamic branching entirely.

@wlandau
Copy link
Member Author

wlandau commented Jan 20, 2020

cc @jennysjaarda

@jennysjaarda
Copy link

Oh interesting, that seems risky then. I'll take your suggestion in #1140 and output to a single directory instead of multiple, but this could still result in problems. How best to get around this do you think if you files need to be written? Aggregate first and then create a function which writes by parsing over the aggregated target? Would this solve it?

@jennysjaarda
Copy link

Maybe something like this?


make_lines <- function(){
  output <- c("lines")
}

write_file <- function(content, file_names, dir) {
  for(i in 1:length(file_names)){
    dir_out <- file.path(dir, file_names[i])
    writeLines(content[[i]], dir_out)
  }
}

plan <- drake_plan(
  file_names = c(1L, 2L, 3L, 4L),
  lines = target(
    make_lines(), dynamic = map(file_names)
  ),
  write_lines = write_file(lines, file_names, dir = file_out("all_figures"))
)

@jennysjaarda
Copy link

I realize that won't work... as the output of write_file isn't lines but the subtarget name..

@wlandau
Copy link
Member Author

wlandau commented Jan 20, 2020

#1141 (comment) is on the right track. Those dynamic targets should do the expensive work behind the files but not write the files themselves. Then a downstream target writes all the files.

@jennysjaarda
Copy link

I can't quite get it to work, but if you have suggestions, let me know!
I think this could do the trick and be a bit simpler as well:

plan <- drake_plan(
  file_names = c(1L, 2L, 3L, 4L),
  write_lines = target(
    write_something(), dynamic = map(file_names)
  ),
  write_lines_out = {
    write_lines
    file_out("/output_folder")}
)

@wlandau
Copy link
Member Author

wlandau commented Jan 21, 2020

The file_out() declaration should actually be in the same target where you write the files. That's not always a bad thing since file writing is cheap compared to the scale of work drake is designed to handle (see https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets). Maybe something like this:

plan <- drake_plan(
  settings = c(1, 2, 3, 4), 
  datasets = target(
    do_expensive_computation(settings),
    dynamic = map(settings)
  ),
  external_files = target(
    write_all_datasets_quickly(datasets, settings)
  )
)

Or if everything is fast, maybe just

plan <- drake_plan(
  files = c(1, 2, 3, 4),
  external_files = target(
    write_all_files_quickly(files)
  )
)

@jennysjaarda
Copy link

Thanks for the suggestion, in your example I guess you would just add a file_out() command within the appropriate target then?

I'm just having a hard time with the write_all_datasets_quickly function. I think I'm missing a basic idea regarding the output of dynamic targets. In the example before (and also below) I gave the output of write function is just the name of the target. If I loadd(lines) and loadd(file_names) and then run function(lines, file_names, "temp_dir"), it works fine, but within the make() call I don't get the desired output.

make_lines <- function(){
  output <- c("lines")
}

write_file <- function(content, file_names, dir) {
  for(i in 1:length(file_names)){
    dir_out <- file.path(dir, file_names[i])
    writeLines(content[[i]], dir_out) # something needs to change here. 
  }
}

plan <- drake_plan(
  file_names = c(1L, 2L, 3L, 4L),
  lines = target(
    make_lines(), dynamic = map(file_names)
  ),
  write_lines = write_file(lines, file_names, dir = file_out("all_figures"))
)

@wlandau
Copy link
Member Author

wlandau commented Jan 21, 2020

in your example I guess you would just add a file_out() command within the appropriate target then?

Yes, I agree.

I'm just having a hard time with the write_all_datasets_quickly function. I think I'm missing a basic idea regarding the output of dynamic targets. In the example before (and also below) I gave the output of write function is just the name of the target. If I loadd(lines) and loadd(file_names) and then run function(lines, file_names, "temp_dir"), it works fine, but within the make() call I don't get the desired output.

What version of drake are you using? Dynamic branching actually works differently between versions 7.8.0 and 7.9.0. In the former, it was difficult to reference entire dynamic targets (aggregates of sub-targets). The latter has #1105 and #1106, which are technically breaking changes, but for a good cause. See https://books.ropensci.org/drake/dynamic.html#a-note-about-versions, https://books.ropensci.org/drake/dynamic.html#dynamic-targets.

@jennysjaarda
Copy link

I just updated, here's the results:

library(drake)
packageVersion("drake")
#‘7.9.0.9000’
mkdir("all_figures")
make_lines <- function(){
  output <- c("lines")
}

write_file <- function(content, file_names, dir) {
  for(i in 1:length(file_names)){
    dir_out <- paste0(dir,"/", file_names[i],".txt")
    writeLines(content[[i]], dir_out)
  }
}

plan <- drake_plan(
  file_names = c(1L, 2L, 3L, 4L),
  lines = target(
    make_lines(), dynamic = map(file_names)
  ),
  write_lines = write_file(lines, file_names, dir = file_out("all_figures"))
)

make(plan)
readLines("all_figures/1.txt")
#"fd205d580aec7543"

@jennysjaarda
Copy link

I'll just add one more bit of code that could help:

write_file(readd(lines), readd(file_names), "all_figures")
readLines("all_figures/1.txt")
# "lines"
make(plan) #runs again and returns to previous file contents 

@wlandau
Copy link
Member Author

wlandau commented Jan 21, 2020

I think that is the right approach. I am getting different results than you, but I think it is the right way to go.

library(drake)

dir.create("all_figures")

make_lines <- function() {
  output <- c("lines")
}

write_file <- function(content, file_names, dir) {
  for(i in seq_len(length(file_names))) {
    file_out <- paste0(dir, "/", file_names[i], ".txt")
    writeLines(content[[i]], file_out)
  }
}

plan <- drake_plan(
  file_names = c(1L, 2L, 3L, 4L),
  lines = target(
    make_lines(), dynamic = map(file_names)
  ),
  write_lines = write_file(lines, file_names, dir = file_out("all_figures"))
)

make(plan)
#> target file_names
#> dynamic lines
#> subtarget lines_0b3474bd
#> subtarget lines_b2a5c9b8
#> subtarget lines_71f311ad
#> subtarget lines_98cf3c11
#> aggregate lines
#> target write_lines

readLines("all_figures/1.txt")
#> [1] "lines"

make(plan)
#> All targets are already up to date.

write_file(readd(lines), readd(file_names), "all_figures")
readLines("all_figures/1.txt")
#> [1] "lines"

make(plan)
#> All targets are already up to date.

Created on 2020-01-21 by the reprex package (v0.3.0)

@jennysjaarda
Copy link

Strange - but I think I found something that could help. I just tested your example on a linux server and I got the same output as you. Previously I was just testing this simple case on my Mac in Rstudio. I don't run anything on my Mac so it's I consider this issue solved for me, are you aware of performance differences between the two platforms? If it's useful, On my mac I am running R version 3.6.1 (2019-07-05), on linux I am running R version 3.5.3 (2019-03-11).

@wlandau
Copy link
Member Author

wlandau commented Jan 21, 2020

I have noticed the occasional odd performance difference. Oddly enough, proc.time() is slow on Macs. If you want to look further, proffer can help, e.g.

proffer::pprof({
  make_lines <- function() {...}
  plan <- drake_plan(...)
  make(plan)
})

@jennysjaarda
Copy link

Thanks for the tip. Coincidentally, I just happened on another case where functionality is different on Mac vs Linux:

create_list <- function(i, a,b,c){
  return(list(a=a,b=b,c=c))
}

plan <- drake_plan(
  run_list = tibble(
    i = 1:10,
  ),
  output = target(
    create_list(run_list$i, "a", "b", "c"), dynamic = map(run_list)
  )
)
make(plan)
length(readd(output))

length(readd(output)) gives 10 on Mac (list of lists) and 30 on linux (one long list)

@wlandau
Copy link
Member Author

wlandau commented Jan 21, 2020

Would you open new issues for follow-up questions like this? I feel like we went off topic a few times.

Do you have drake 7.8.0 on one system and 7.9.0 on another? I ran you example on both Mac and Linux, and they both look like this for me:

library(drake)
library(tibble)

create_list <- function(i, a,b,c){
  return(list(a=a,b=b,c=c))
}

plan <- drake_plan(
  run_list = tibble(
    i = 1:10,
  ),
  output = target(
    create_list(run_list$i, "a", "b", "c"), dynamic = map(run_list)
  )
)

make(plan)
#> In drake, consider r_make() instead of make(). r_make() runs make() in a fresh R session for enhanced robustness and reproducibility.
#> target run_list
#> dynamic output
#> subtarget output_982d7b12
#> subtarget output_2c5f0700
#> subtarget output_623f5045
#> subtarget output_4dd68315
#> subtarget output_1ebcff01
#> subtarget output_346c3e69
#> subtarget output_f1d97afc
#> subtarget output_d480acf2
#> subtarget output_94395b4c
#> subtarget output_da98694d
#> aggregate output

str(readd(output))
#> List of 30
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"
#>  $ a: chr "a"
#>  $ b: chr "b"
#>  $ c: chr "c"

Created on 2020-01-21 by the reprex package (v0.3.0)

Because dynamic targets are vctrs, lists get concatenated together like this by design. If you prefer a list with one sub-target per list element, try readd(output, subtarget_list = TRUE) in the latest dev version, re #1139.

library(drake)
library(tibble)

create_list <- function(i, a,b,c){
  return(list(a=a,b=b,c=c))
}

plan <- drake_plan(
  run_list = tibble(
    i = 1:10,
  ),
  output = target(
    create_list(run_list$i, "a", "b", "c"), dynamic = map(run_list)
  )
)

make(plan)
#> target run_list
#> dynamic output
#> subtarget output_982d7b12
#> subtarget output_2c5f0700
#> subtarget output_623f5045
#> subtarget output_4dd68315
#> subtarget output_1ebcff01
#> subtarget output_346c3e69
#> subtarget output_f1d97afc
#> subtarget output_d480acf2
#> subtarget output_94395b4c
#> subtarget output_da98694d
#> aggregate output

str(readd(output, subtarget_list = TRUE))
#> List of 10
#>  $ output_982d7b12:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_2c5f0700:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_623f5045:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_4dd68315:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_1ebcff01:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_346c3e69:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_f1d97afc:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_d480acf2:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_94395b4c:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ output_da98694d:List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"

Created on 2020-01-21 by the reprex package (v0.3.0)

Either that or wrap each sub-target's value in another list.

library(drake)
library(tibble)

create_list <- function(i, a,b,c){
  return(list(list(a=a,b=b,c=c)))
}

plan <- drake_plan(
  run_list = tibble(
    i = 1:10,
  ),
  output = target(
    create_list(run_list$i, "a", "b", "c"), dynamic = map(run_list)
  )
)

make(plan)
#> target run_list
#> dynamic output
#> subtarget output_982d7b12
#> subtarget output_2c5f0700
#> subtarget output_623f5045
#> subtarget output_4dd68315
#> subtarget output_1ebcff01
#> subtarget output_346c3e69
#> subtarget output_f1d97afc
#> subtarget output_d480acf2
#> subtarget output_94395b4c
#> subtarget output_da98694d
#> aggregate output

str(readd(output))
#> List of 10
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"
#>  $ :List of 3
#>   ..$ a: chr "a"
#>   ..$ b: chr "b"
#>   ..$ c: chr "c"

Created on 2020-01-21 by the reprex package (v0.3.0)

@jennysjaarda
Copy link

Yes I will open a new issue - sorry!

@wlandau
Copy link
Member Author

wlandau commented Feb 22, 2020

Update: #1178 can combine dynamic branching with dynamic files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants