Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special format for files? #1127

Closed
2 tasks done
wlandau opened this issue Dec 31, 2019 · 27 comments
Closed
2 tasks done

Special format for files? #1127

wlandau opened this issue Dec 31, 2019 · 27 comments

Comments

@wlandau
Copy link
Member

wlandau commented Dec 31, 2019

Prework

Proposal

Related: #1126, https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. cc @billdenney. The goal is to return pre-rendered images as targets. We want to avoid re-rendering plots unnecessarily, and we do not want to make the user invoke file_out() or mess with image files directly. Maybe target(format = "png") or target(format = "plotly") might accomplish this. Note: we would not be able to pass options like width and height etc. through the format itself, so this may not be feasible.

@billdenney
Copy link
Contributor

I like this in general, but I wonder if a more general solution like a new function called drake_tempfile(pattern, fileext) which has some form of implicit file_out() integration may be simpler.

In a case like #1126, all that would be needed is a temporary file which is tracked. It would also need to have a file_in() dependency tracked for the report.

Maybe even better would be to have tempfile_in() and tempfile_out() which behave like file_in() and file_out() but they do not accept paths just filenames and for tempfile_out() the filename must be unique within a single make(). The file itself would be stored within the cache directory (if that didn’t break things). A challenge is that I’d want to be able to write filenames as variables or code (e.g. sprintf() or paste()) to allow for looping to create sequences of files.

@wlandau wlandau changed the title Special formats for rendered images Special formats for rendered images? Dec 31, 2019
@wlandau
Copy link
Member Author

wlandau commented Dec 31, 2019

Are you picturing a multiple-plot-per-target situation?

I hesitate to require more file-oriented language in commands. As soon as we start having to think about files instead of in-memory objects, we start to lose the convenience and power of drake's R-oriented targets.

What would be the advantage of drake_tempfile(pattern, fileext) over just file_out()?

@billdenney
Copy link
Contributor

LaTeX allows for one plot per page in a .pdf and to index into that file by page.

The advantage is that the files I'm thinking of are just there for the convenience of speeding up drake and its pipeline. The files are not files that I otherwise want to generate or keep. I thought that file_out() didn't allow function calls or similar (like file_out(paste0("figure-", figure_num, ".pdf"))). Maybe I'm behind the times with dynamic targets, though.

@wlandau
Copy link
Member Author

wlandau commented Jan 1, 2020

LaTeX allows for one plot per page in a .pdf and to index into that file by page.

I do not believe I have ever used that feature of LaTeX. Would you mind sharing a reprex demo?

The advantage is that the files I'm thinking of are just there for the convenience of speeding up drake and its pipeline. The files are not files that I otherwise want to generate or keep.

To save computation time in the rendering, we do need to hold on to the plots we cache. (Otherwise, we would invalidate targets.) And if we keep the files, it is usually cleaner and more convenient to abstract them as true targets.

I thought that file_out() didn't allow function calls or similar (like file_out(paste0("figure-", figure_num, ".pdf")))

It requires tidy eval during the construction of the plan, e.g. file_out(!!paste0("figure-", figure_num, ".pdf")). There is no dynamic equivalent.

@wlandau-lilly
Copy link
Collaborator

I retract one of my previous statements (crossed out above). @billdenney, we might be able to use the existing drake_tempfile() to send the pdf to a temporary location and then let the cache move it to the permanent location. So if your target returns ".drake/drake/tmp/file609677bb2fb1" and you selected format = "pdf", then drake could move it to .drake/drake/return/7a3f077af6c8e425.

Even better: we could consider something like target(format = "file") for arbitrary files and directories.

save_plot <- function(data, path) {
  plot <- ggplot(data) + ...
  ggsave(path, plot)
  path
}

plan <- drake_plan(
  data = get_data(),
  plot_path = target(
    save_plot(data, path = drake_tempfile()),
    format = "file"
  )
)

For the plot_path target, drake should

  1. Build the target, which saves the file to a drake_tempfile().
  2. Move the file to the permanent location in the cache.
  3. Set the return value of the target to be the final path.

@wlandau-lilly
Copy link
Collaborator

At first glance, it seems like we're reinventing file_out(). But not really: it should be compatible with dynamic branching, and it does not require inserting literal pre-defined paths in file_out() statements.

@wlandau-lilly
Copy link
Collaborator

Problem: should we use full paths or relative paths to the final file location? I usually like full paths, but if the cache gets moved, it will be a problem. Perhaps it should just be the path relative to the cache path.

@billdenney
Copy link
Contributor

This is exactly like what I was thinking! Thanks!

I agree that it has a notable difference from file_out() as you describe. My first question is: could it be simplified by creating a new drake_cachedfile(file_ext) so that it could be called as:

plan <- drake_plan(
  data = get_data(),
  plot_path = save_plot(data, path = drake_cachedfile(file_ext=".pdf"))
)

And, drake_cachedfile(file_ext=".pdf") would return .drake/drake/return/7a3f077af6c8e425.pdf (where 7a3f077af6c8e425 is directly linked to the plot_path target), and when plot_path is invalidated, all files in .drake/drake/return/ matching the regexp ^7a3f077af6c8e425 would be removed.

As for the question of full path or relative, I think that the simplest would be to have it look just like .drake/drake/return/7a3f077af6c8e425.pdf. In my use cases, I tend to have the cache in the same directory (so it would work as-is). And, in the case that I don't have it in the same directory, as a relative path including the top cache directory should be obvious for the user for how to get to the correct location.

One other advantage of drake_cachedfile(file_ext=".pdf") (note that I included the period in the extension) is that multiple files could be created:

plan <- drake_plan(
  data = get_data(),
  plot_path =
    save_plot(
      data,
      path1 = drake_cachedfile(file_ext="-1.pdf"),
      path2 = drake_cachedfile(file_ext="-2.pdf")
    )
)

It is not immediately clear to me how the multiple file creation could be accomplished with drake_tempfile().

For the way that I'm using drake_cachedfile(), file_suffix may be clearer than file_ext as an argument name since the way I'm thinking of it, the argument is paste0()ed to the target hash value.

P.S. It's amusing to watch you noodle on the problem as I'm typing this. :)

@billdenney
Copy link
Contributor

As for the LaTeX multi-page .pdf figure used as a single-page figure, it has been a while since I've used it (almost all of my LaTeX these days is knitr/rmarkdown-generated or snippets that I use to augment reports).

https://tex.stackexchange.com/questions/7938/pdflatex-includegraphics-and-multi-page-pdf-files

@wlandau
Copy link
Member Author

wlandau commented Jan 3, 2020

I agree that it has a notable difference from file_out() as you describe. My first question is: could it be simplified by creating a new drake_cachedfile(file_ext)...

I am afraid not. The base name of the .drake/drake/return/... path is a hash, and that hash is computed from the contents of the file. So the file must exist in a temporary location, then it gets hashed, then it gets assigned a path that uses the hash.

And do we need file extensions? Files in the cache are typically not meant to be directly invoked by the user, but I guess this case is different. I am not sure it is necessary if literate programming is the only use case. But if we expect people to directly view files with, say, Adobe Reader, then extensions might matter.

Out of curiosity, are you aware that file_out() and file_in() can handle entire directories, e.g. file_out("folder_with_figures")?

@wlandau
Copy link
Member Author

wlandau commented Jan 3, 2020

Out of curiosity, are you aware that file_out() and file_in() can handle entire directories, e.g. file_out("folder_with_figures")?

Ah, I forgot that you prefer knitr's automatic formatting. Never mind.

@billdenney
Copy link
Contributor

I am aware that file_in() and file_out() can handle directories-- I use that feature often in a different context.

As you remind me of that, though, I could do something like:

plan <- drake_plan(
  data = get_data(),
  plot_path =
    save_plot(
      data,
      path = file_out("folder_with_figures")
    )
)

Then, on the knitr side, I could just have a file_in("folder_with_figures") that wouldn't actually be part of the code-- just an indicator for drake that I'm using the folder in the document-- as I'm loading the plots.

I've not tested it, but maybe that is the simple way to do things without the proposed complexity?

@billdenney
Copy link
Contributor

billdenney commented Jan 3, 2020

And for extensions, LaTeX can be fragile about requiring extensions. I don't know that it's required, but having the flexibility seems like it would prevent problems down the road.

For example, last I knew to load a .svg file you had to give the file name without the extension and the extension would be added (from section 2.4 of http://ctan.math.illinois.edu/graphics/svg/doc/svg.pdf, "where {svg filename} is the file name of the SVG file, where any given file extension will be replaced with .svg ruthlessly.").

Edit: I'm not planning to use .svg here; it is just one example of extension-fragile code.

@wlandau wlandau changed the title Special formats for rendered images? Special formats for files? Jan 4, 2020
@wlandau
Copy link
Member Author

wlandau commented Jan 4, 2020

Re #1127 (comment), file_in() inside R Markdown does appear to work. Nice idea!

library(drake)
library(rmarkdown)
library(webshot)

lines <- c(
  "---",
  "title: \"123\"",
  "output: html_document",
  "---",
  "",
  "```{r}",
  "readLines(file_in(\"xyz\"))",
  "```"
)

writeLines(lines, "123.Rmd")
  
plan <- drake_plan(
  x = writeLines("abc", file_out("xyz")),
  y = render(
    knitr_in("123.Rmd"),
    output_file = file_out("123.html"),
    quiet = TRUE
  )
)

config <- drake_config(plan)
vis_drake_graph(config)

deps_target(y, config) # includes file xyz
#> # A tibble: 3 x 2
#>   name     type    
#>   <chr>    <chr>   
#> 1 xyz      file_in 
#> 2 123.html file_out
#> 3 123.Rmd  knitr_in

make(plan)
#> target x
#> target y

webshot("123.html")

Created on 2020-01-04 by the reprex package (v0.3.0)

@wlandau
Copy link
Member Author

wlandau commented Jan 4, 2020

Also, for target(format = "file"), I think we should just let those files exist outside the cache and track them just like file_out(). It will take some work to combine file tracking with formats, but I think it will be worth it in the end. People struggle so much with the literal path requirement of file_out(), and we want some way to produce dynamic files.

@wlandau wlandau changed the title Special formats for files? Special format for files? Jan 4, 2020
@wlandau
Copy link
Member Author

wlandau commented Jan 5, 2020

Branch 1127 has a proof of concept of target(format = "path"), so we know it is possible. But the exercise of implementation made me realize that it's a bit of a hack. It's trying to use specialized formats without putting the data in the cache, and it's trying to do file tracking without file_out(). It overlaps with existing features and in a non-standard different way, which could cause confusion when it comes to maintaining the package and teaching users. One instance worth highlighting: it blurs the line between what is and is not possible in dynamic branching. Right now, triggers and file tracking are always apply to the static level, not to sub-targets. If we make these things fully dynamic, I think it is best not to go half way.

@wlandau
Copy link
Member Author

wlandau commented Jan 5, 2020

Maybe we can look at files differently. drake's approach in file_in() etc. is to check the state of a file and rebuild downstream targets accordingly, but file_in() and friends require literal fixed paths up front, which is not always possible. (Related: posts here and here). What if we declare and validate files independently of drake?

# Do some work...

# Save the results.
saveRDS(results, "results.rds")

# Declare the file.
file_set("results.rds")

# Do some more work that could possibly corrupt results.rds...

# Ensure results.rds remains unchanged.
file_val("results.rds")

In the thought experiment above, file_set() computes validation metadata and stores it in a hidden cache somewhere. That includes the file size, time stamp, and hash. file_val() checks the file and asserts that the current metadata agree with the cached metadata. (It checks the time stamp and size first (fast) then moves on to the hash only when necessary (slow)).

Here is what it might look like inside a drake workflow.

library(drake)
library(fileval)

write_data <- function(contents, file) {
  writeLines(contents, file)
  file_set(path) # Store metadata in a hidden cache.
  path
}

process_data <- function(file) {
  file_val(file) # Check the new metadata and error if it disagrees with the old metadata.
  do_processing(file)
}

plan <- drake_plan(
  contents = c("data", "in", "dynamic", "files"),
  file = paste0(seq_len(4), ".txt"),
  data = target(
    write_data(contents, file), # Create and declare the file.
    dynamic = map(contents, file)
  ),
  processing = target(
    process_data(data), # Validate and process the file.
    dynamic = map(data)
  )
)

# Build targets
make(plan)

# What if a file gets mangled?
unlink("1.txt")

# This make() will error out.
make(plan)

# Invalidate the dynamic sub-target that generated the file.
clean(data_*) # Just a sketch, "*" stands for a hash.

# Now, make() regenerates the file so the processing_* sub-targets can use it.
make(plan)

We could add functionality for directories and URLs too, and we could put it in a separate package.

@billdenney
Copy link
Contributor

First and most minor point: I don't think that the file date should be used for hashing. I wouldn't trust same size and date as indicating that the file is identical. jdupes (https://github.com/jbruchon/jdupes) has a good description of everything that could be checked to find identical files, and selecting the relevant points there should be the basis of that checking. That said, I'd think that functionality would better fit in a separate package as you indicated (that drake would use); digest may be a good fit, but I'm not sure that it's a great fit there (I'm sure that Dirk will give good, quick guidance for his opinion either way).

I like where this is going, and I may be saying the same thing here, but I think that I'm saying something subtly different. I'm going to use different names than file_set() and file_val() so that the subsequent discussion is easier, but I don't have an opinion on the function names.

What if there were file_cache() and file_check_cache() functions. file_cache() would take its argument and generate a hash. The argument to file_cache() would go into a new drake "file" namespace as a key with its value as the hash (and all other information required to rapidly determine if the file were identical or not, like size).

file_check_cache() would then add the dependency on that file and enable the hash/sameness check for the requirement to rebuild the cache.

In the drake file namespace, the key would be the filename (with the path however it was provided) and the value would be the hashing info.

The only added implementation detail I'd request is that whatever is stored in the hash for the filename is short for those of us who are on Windows to prevent long filename issues. That would likely require some degree of indirection like hash the filename to something short (e.g. ~16-32 characters) and then within that store the filename and its hashing info. I'd hope that the probability of hash collision would be sufficiently low that you wouldn't have to worry about multiple files with the same hash.

@wlandau
Copy link
Member Author

wlandau commented Jan 5, 2020

First and most minor point: I don't think that the file date should be used for hashing. I wouldn't trust same size and date as indicating that the file is identical. jdupes (jbruchon/jdupes) has a good description of everything that could be checked to find identical files, and selecting the relevant points there should be the basis of that checking.

I was thinking we would ultimately fall back to digest but use quick surrogates to check if the hash is even worth recomputing. drake has its own logic to do this for external files, and my first thought was to replicate it.

drake/R/drake_meta_.R

Lines 247 to 258 in e2ad59d

should_rehash_storage <- function(
size_threshold,
new_mtime,
old_mtime,
new_size,
old_size
) {
small <- (new_size < size_threshold) %|||NA% TRUE
touched <- (new_mtime > old_mtime) %|||NA% TRUE
resized <- (abs(new_size - old_size) > rehash_storage_size_tol) %|||NA% TRUE
small || touched || resized
}

I will have a look at https://github.com/jbruchon/jdupes. Where exactly does it list all its file comparison heuristics?

What if there were file_cache() and file_check_cache() functions. file_cache() would take its argument and generate a hash. The argument to file_cache() would go into a new drake "file" namespace as a key with its value as the hash (and all other information required to rapidly determine if the file were identical or not, like size)...

I would prefer to avoid such direct integration with drake, especially drake's internals. I prefer non-drake file validation because it lets non-drake users use it, it is much simpler to develop and maintain, and it does not overlap as much with existing functionality.

The only added implementation detail I'd request is that whatever is stored in the hash for the filename is short for those of us who are on Windows to prevent long filename issues.

Yeah, that's something we have a chance to address up front without breaking things. I will think about it.

@billdenney
Copy link
Contributor

billdenney commented Jan 6, 2020

I have another idea that may sidestep all of this. What if we assume that we must read the file in regardless and that is an acceptable cost. Then something like the following would work, I think:

add_hash_to_file <- function(filename) {
  structure(
    filename,
    hash=digest::digest(filename, file=TRUE)
  )
}

my_plan <-
  drake_plan(
    big_filename=
      target(
        command=add_hash_to_file("c:/git/prj/Marketing/Capabilities Presentation/Carta Marina/iceland_map_a002s.tif"),
        trigger=trigger(condition=TRUE)
      )
  )

make(my_plan)

The file is 163 MB, and it takes a median of ~1.2 seconds to load and hash (on my laptop with an NVMe hard drive). Then, wherever the file is needed, use that target name. It has the benefit that it sidesteps all the complexity of file_in(), file_out(), and the other proposals here. It is simply another target, and it is usable just like the original filename string. A similar function could easily be written to hash all files in a directory (optionally recursively).

Another benefit is that it sidesteps issues with static or dynamic branching since the filename could be anything that evaluates to a filename (i.e. the user could generate the filename however they want including with paste() or sprintf() or within any function).

Is there a way that triggers could be added as an attribute to a function or some other way that the plan could be simplified to the following?

my_plan <-
  drake_plan(
    big_filename=add_hash_to_file("c:/git/prj/Marketing/Capabilities Presentation/Carta Marina/iceland_map_a002s.tif")
  )

I think that something like this would cover effectively all of my needs.

Edit: I realized that readBin() by default only reads the first byte. When I revised to the full 163 MB file it took ~1.2 seconds. That's still OK for my use cases.

Edit 2: Updated the add_hash_to_file() function so that it uses the simplification and speed-up of file=TRUE.

@wlandau
Copy link
Member Author

wlandau commented Jan 6, 2020

I agree that we should check files even when targets are otherwise up to date. If file_val() from #1127 (comment) reported a change instead of erroring out, I think that would cover it. The reason I bring it up again is that we again find ourselves in a situation where surrogates like modification time could help us avoid re-hashing files all the time. (NB digest(file = TRUE) is a nice alternative to digest(readBin(...))).

@wlandau
Copy link
Member Author

wlandau commented Jan 6, 2020

But the more I see what those interfaces are like, the more I come back to target(format = "file") because it creates less work and fewer targets for the user. I need to reflect on the options some more.

@wlandau wlandau reopened this Jan 6, 2020
@billdenney
Copy link
Contributor

I could imagine target(format="file") could essentially be a shortcut to making two targets like I did above. Perhaps that would occur behind the scenes for the user, but otherwise, this method doesn't seem too complicated to me.

@billdenney
Copy link
Contributor

And, thanks for the pointer to digest(file), that sped it up by 4-fold to ~300 msec!

@wlandau
Copy link
Member Author

wlandau commented Jan 10, 2020

After more reflection, I am moving away from target(format = "file"). Reasons:

  1. It is an awkward combination of a trigger and a format. It is technically possible to implement, but it is neither a pure trigger nor a pure format, so it seems like kind of a hack. The ambiguous intent and overlap in functionality would add confusion on many levels.
  2. Because of the extra checking involved, it would really slow down dynamic branching.
  3. drake is fastest and easiest when we save artifacts as objects instead of files. drake's primary intent is to think about objects and functions, not scripts or files, and that is not going to change. Although this does create friction for new users, e.g. reports like this one, I think the better long term solution is to educate more users and steer them towards function-oriented programming.
  4. We are in no rush. I think we should take the time to explore more file validation systems. Special format for files? #1127 (comment) is not ideal, but maybe we can think of better ones. Special format for files? #1127 (comment) pretty much works for the static case, but since triggers are a static-branching-only feature, we need something different for dynamic branching. Because of (3), let's see if we can find or develop external solutions before we try to bake something into drake.

@wlandau wlandau closed this as completed Jan 10, 2020
@billdenney
Copy link
Contributor

Thanks for the thoughtful consideration.

@wlandau wlandau mentioned this issue Jan 10, 2020
2 tasks
@wlandau
Copy link
Member Author

wlandau commented Jan 10, 2020

Sure, thanks for chiming in so often.

Here is a different way we could approach the problem: #1131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants