-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special format for files? #1127
Comments
I like this in general, but I wonder if a more general solution like a new function called In a case like #1126, all that would be needed is a temporary file which is tracked. It would also need to have a Maybe even better would be to have |
Are you picturing a multiple-plot-per-target situation? I hesitate to require more file-oriented language in commands. As soon as we start having to think about files instead of in-memory objects, we start to lose the convenience and power of What would be the advantage of |
LaTeX allows for one plot per page in a .pdf and to index into that file by page. The advantage is that the files I'm thinking of are just there for the convenience of speeding up |
I do not believe I have ever used that feature of LaTeX. Would you mind sharing a reprex demo?
It requires tidy eval during the construction of the plan, e.g. |
I retract one of my previous statements (crossed out above). @billdenney, we might be able to use the existing Even better: we could consider something like save_plot <- function(data, path) {
plot <- ggplot(data) + ...
ggsave(path, plot)
path
}
plan <- drake_plan(
data = get_data(),
plot_path = target(
save_plot(data, path = drake_tempfile()),
format = "file"
)
) For the
|
At first glance, it seems like we're reinventing |
Problem: should we use full paths or relative paths to the final file location? I usually like full paths, but if the cache gets moved, it will be a problem. Perhaps it should just be the path relative to the cache path. |
This is exactly like what I was thinking! Thanks! I agree that it has a notable difference from plan <- drake_plan(
data = get_data(),
plot_path = save_plot(data, path = drake_cachedfile(file_ext=".pdf"))
) And, As for the question of full path or relative, I think that the simplest would be to have it look just like One other advantage of plan <- drake_plan(
data = get_data(),
plot_path =
save_plot(
data,
path1 = drake_cachedfile(file_ext="-1.pdf"),
path2 = drake_cachedfile(file_ext="-2.pdf")
)
) It is not immediately clear to me how the multiple file creation could be accomplished with For the way that I'm using P.S. It's amusing to watch you noodle on the problem as I'm typing this. :) |
As for the LaTeX multi-page .pdf figure used as a single-page figure, it has been a while since I've used it (almost all of my LaTeX these days is knitr/rmarkdown-generated or snippets that I use to augment reports). https://tex.stackexchange.com/questions/7938/pdflatex-includegraphics-and-multi-page-pdf-files |
I am afraid not. The base name of the And do we need file extensions? Files in the cache are typically not meant to be directly invoked by the user, but I guess this case is different. I am not sure it is necessary if literate programming is the only use case. But if we expect people to directly view files with, say, Adobe Reader, then extensions might matter.
|
Ah, I forgot that you prefer |
I am aware that As you remind me of that, though, I could do something like: plan <- drake_plan(
data = get_data(),
plot_path =
save_plot(
data,
path = file_out("folder_with_figures")
)
) Then, on the I've not tested it, but maybe that is the simple way to do things without the proposed complexity? |
And for extensions, LaTeX can be fragile about requiring extensions. I don't know that it's required, but having the flexibility seems like it would prevent problems down the road. For example, last I knew to load a .svg file you had to give the file name without the extension and the extension would be added (from section 2.4 of http://ctan.math.illinois.edu/graphics/svg/doc/svg.pdf, "where {svg filename} is the file name of the SVG file, where any given file extension will be replaced with .svg ruthlessly."). Edit: I'm not planning to use .svg here; it is just one example of extension-fragile code. |
Re #1127 (comment), library(drake)
library(rmarkdown)
library(webshot)
lines <- c(
"---",
"title: \"123\"",
"output: html_document",
"---",
"",
"```{r}",
"readLines(file_in(\"xyz\"))",
"```"
)
writeLines(lines, "123.Rmd")
plan <- drake_plan(
x = writeLines("abc", file_out("xyz")),
y = render(
knitr_in("123.Rmd"),
output_file = file_out("123.html"),
quiet = TRUE
)
)
config <- drake_config(plan)
vis_drake_graph(config) deps_target(y, config) # includes file xyz
#> # A tibble: 3 x 2
#> name type
#> <chr> <chr>
#> 1 xyz file_in
#> 2 123.html file_out
#> 3 123.Rmd knitr_in
make(plan)
#> target x
#> target y
webshot("123.html") Created on 2020-01-04 by the reprex package (v0.3.0) |
Also, for |
Branch 1127 has a proof of concept of |
Maybe we can look at files differently. # Do some work...
# Save the results.
saveRDS(results, "results.rds")
# Declare the file.
file_set("results.rds")
# Do some more work that could possibly corrupt results.rds...
# Ensure results.rds remains unchanged.
file_val("results.rds") In the thought experiment above, Here is what it might look like inside a library(drake)
library(fileval)
write_data <- function(contents, file) {
writeLines(contents, file)
file_set(path) # Store metadata in a hidden cache.
path
}
process_data <- function(file) {
file_val(file) # Check the new metadata and error if it disagrees with the old metadata.
do_processing(file)
}
plan <- drake_plan(
contents = c("data", "in", "dynamic", "files"),
file = paste0(seq_len(4), ".txt"),
data = target(
write_data(contents, file), # Create and declare the file.
dynamic = map(contents, file)
),
processing = target(
process_data(data), # Validate and process the file.
dynamic = map(data)
)
)
# Build targets
make(plan)
# What if a file gets mangled?
unlink("1.txt")
# This make() will error out.
make(plan)
# Invalidate the dynamic sub-target that generated the file.
clean(data_*) # Just a sketch, "*" stands for a hash.
# Now, make() regenerates the file so the processing_* sub-targets can use it.
make(plan) We could add functionality for directories and URLs too, and we could put it in a separate package. |
First and most minor point: I don't think that the file date should be used for hashing. I wouldn't trust same size and date as indicating that the file is identical. jdupes (https://github.com/jbruchon/jdupes) has a good description of everything that could be checked to find identical files, and selecting the relevant points there should be the basis of that checking. That said, I'd think that functionality would better fit in a separate package as you indicated (that I like where this is going, and I may be saying the same thing here, but I think that I'm saying something subtly different. I'm going to use different names than What if there were
In the The only added implementation detail I'd request is that whatever is stored in the hash for the filename is short for those of us who are on Windows to prevent long filename issues. That would likely require some degree of indirection like hash the filename to something short (e.g. ~16-32 characters) and then within that store the filename and its hashing info. I'd hope that the probability of hash collision would be sufficiently low that you wouldn't have to worry about multiple files with the same hash. |
I was thinking we would ultimately fall back to Lines 247 to 258 in e2ad59d
I will have a look at https://github.com/jbruchon/jdupes. Where exactly does it list all its file comparison heuristics?
I would prefer to avoid such direct integration with
Yeah, that's something we have a chance to address up front without breaking things. I will think about it. |
I have another idea that may sidestep all of this. What if we assume that we must read the file in regardless and that is an acceptable cost. Then something like the following would work, I think: add_hash_to_file <- function(filename) {
structure(
filename,
hash=digest::digest(filename, file=TRUE)
)
}
my_plan <-
drake_plan(
big_filename=
target(
command=add_hash_to_file("c:/git/prj/Marketing/Capabilities Presentation/Carta Marina/iceland_map_a002s.tif"),
trigger=trigger(condition=TRUE)
)
)
make(my_plan) The file is 163 MB, and it takes a median of ~1.2 seconds to load and hash (on my laptop with an NVMe hard drive). Then, wherever the file is needed, use that target name. It has the benefit that it sidesteps all the complexity of Another benefit is that it sidesteps issues with static or dynamic branching since the filename could be anything that evaluates to a filename (i.e. the user could generate the filename however they want including with Is there a way that triggers could be added as an attribute to a function or some other way that the plan could be simplified to the following? my_plan <-
drake_plan(
big_filename=add_hash_to_file("c:/git/prj/Marketing/Capabilities Presentation/Carta Marina/iceland_map_a002s.tif")
) I think that something like this would cover effectively all of my needs. Edit: I realized that Edit 2: Updated the |
I agree that we should check files even when targets are otherwise up to date. If |
But the more I see what those interfaces are like, the more I come back to |
I could imagine |
And, thanks for the pointer to |
After more reflection, I am moving away from
|
Thanks for the thoughtful consideration. |
Sure, thanks for chiming in so often. Here is a different way we could approach the problem: #1131 |
Prework
drake
's code of conduct.Proposal
Related: #1126, https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets. cc @billdenney. The goal is to return pre-rendered images as targets. We want to avoid re-rendering plots unnecessarily, and we do not want to make the user invoke
file_out()
or mess with image files directly. Maybetarget(format = "png")
ortarget(format = "plotly")
might accomplish this. Note: we would not be able to pass options like width and height etc. through the format itself, so this may not be feasible.The text was updated successfully, but these errors were encountered: