Skip to content

Commit

Permalink
Fix #13, fix #18
Browse files Browse the repository at this point in the history
  • Loading branch information
wlandau-lilly committed Jul 16, 2018
1 parent 2e26ef6 commit 1e20845
Show file tree
Hide file tree
Showing 8 changed files with 101 additions and 155 deletions.
6 changes: 3 additions & 3 deletions 02-example-main.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ plan <- drake_plan(
select(-X__1),
hist = create_plot(data),
fit = lm(Sepal.Width ~ Petal.Width + Species, data),
rmarkdown::render(
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
Expand All @@ -97,7 +97,7 @@ So far, we have just been setting the stage. Use `make()` to do the real work. T
make(plan)
```

Except for files like `report.html`, your output is stored in a hidden `.drake/` folder. Reading it back is easy.
Except for output files like `report.html`, your output is stored in a hidden `.drake/` folder. Reading it back is easy.

```{r readddata1}
readd(data) # See also loadd().
Expand Down Expand Up @@ -133,7 +133,7 @@ create_plot <- function(data) {
vis_drake_graph(config)
```

The next `make()` just builds `hist` and `report.html`. No point in wasting time on the data or model.
The next `make()` just builds `hist` and `report`. No point in wasting time on the data or model.

```{r justhistetc}
make(plan)
Expand Down
92 changes: 81 additions & 11 deletions 03-plans.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,23 @@
```{r loaddrake14, echo = FALSE}
suppressPackageStartupMessages(library(drake))
pkgconfig::set_config("drake::strings_in_dots" = "literals")
tmp <- file.create("report.Rmd")
dat <- system.file(
file.path("examples", "main", "raw_data.xlsx"),
package = "drake",
mustWork = TRUE
)
tmp <- file.copy(from = dat, to = "raw_data.xlsx")
rmd <- system.file(
file.path("examples", "main", "report.Rmd"),
package = "drake",
mustWork = TRUE
)
tmp <- file.copy(from = rmd, to = "report.Rmd")
```

## What is a workflow plan data frame?

Your workflow plan data frame is the object where you declare all the objects and files you are going to produce when you run your project. It enumerates each output item, or *target*, and the R *command* that will produce it. Here is the workflow plan from our [previous example](#hpc).
Your workflow plan data frame is the object where you declare all the objects and files you are going to produce when you run your project. It enumerates each output R object, or *target*, and the *command* that will produce it. Here is the workflow plan from our [previous example](#hpc).

```{r firstexampleplan}
plan <- drake_plan(
Expand All @@ -18,7 +29,7 @@ plan <- drake_plan(
select(-X__1),
hist = create_plot(data),
fit = lm(Sepal.Width ~ Petal.Width + Species, data),
rmarkdown::render(
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
Expand All @@ -27,7 +38,7 @@ plan <- drake_plan(
plan
```

When you run `make(plan)`, `drake` will produce targets `raw_data`, `data`, `hist`, `fit`, and `report.Rmd`.
When you run `make(plan)`, `drake` will produce targets `raw_data`, `data`, `hist`, `fit`, and `report`.

## Rationale

Expand All @@ -39,14 +50,18 @@ As we saw in our [previous example](#hpc), repeated `make()`s skip work that is

This approach of declaring targets in advance has stood the test of time. The idea dates at least as far back as [GNU Make](https://www.gnu.org/software/make/), which uses `Makefile`s to declare targets and dependencies. `drake`'s predecessor [`remake`](https://github.com/richfitz/remake) uses [`YAML`](http://yaml.org/) files in a similar way.

### Data frames scale well.

`Makefile`s are successful for [Make](https://www.gnu.org/software/make/) because they accommodate software written in multiple languages. However, such external configuration files are not the best solution for R. Maintaining a `Makefile` or a [`remake`](https://github.com/richfitz/remake) [`YAML`](http://yaml.org/) file requires a lot of manual typing. But with `drake` plans, you can use the usual data frame manipulation tools to expand, generate, and piece together large projets. The [gsp example](#example-gsp) shows how `expand.grid()` and `rbind()` to automatically create plans with hundreds of targets. In addition, `drake` has a wildcard templating mechanism to generate large plans.

### You do not need to worry about which targets run first.

When you call `make()` on the plan above, `drake` takes care of `"raw_data.xlsx"`, then `raw_data`, and then `data` in sequence. Once `data` completes, `fit` and `hist` can start in any order, and then `"report.md"` begins once everything else is done. Because `drake` analyzes your commands for dependencies, it always builds your targets in this correct order. That means you can rearrange the rows of the workflow plan in any way you want, which is not the case with lines in an R script or code chunks in a `knitr` report.
When you call `make()` on the plan above, `drake` takes care of `"raw_data.xlsx"`, then `raw_data`, and then `data` in sequence. Once `data` completes, `fit` and `hist` can start in any order, and then `report` begins once everything else is done. The execution does not depend on the order of the rows in your plan. In other words, the following plan is equivalent.

```{r firstexampleplan2}
drake_plan(
fit = lm(Sepal.Width ~ Petal.Width + Species, data),
rmarkdown::render(
report = rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
Expand All @@ -59,11 +74,65 @@ drake_plan(
)
```

## Automatic dependency detection

### Data frames scale well.
Why can you safely scramble the rows of a `drake` plan? Why is row order irrelevant to execution order? Because `drake` analyzes commands for dependencies, and `make()` processes those dependencies before moving on to downstream targets. To detect dependencies, `drake` walks through the [abstract syntax tree](http://adv-r.had.co.nz/Expressions.html#ast-funs) of every piece of code to find the objects and files relevant to the workflow pipeline.

`Makefile`s are successful for [Make](https://www.gnu.org/software/make/) because they accommodate software written in multiple languages. However, such external configuration files are not the best solution for R. Maintaining a `Makefile` or a [`remake`](https://github.com/richfitz/remake) [`YAML`](http://yaml.org/) file requires a lot of manual typing. But with `drake` plans, you can use the usual data frame manipulation tools to expand, generate, and piece together large projets. The [gsp example](#example-gsp) shows how `expand.grid()` and `rbind()` to automatically create plans with hundreds of targets. In addition, `drake` has a wildcard templating mechanism to generate large plans.
```{r depscode_plans}
create_plot <- function(data) {
ggplot(data, aes_string(x = "Petal.Width", fill = "Species")) +
geom_histogram()
}
deps_code(create_plot)
deps_code(
quote({
some_function_i_wrote(data)
rmarkdown::render(
knitr_in("report.Rmd"),
output_file = file_out("report.html"),
quiet = TRUE
)
})
)
```

`drake` detects dependencies without actually running the command.

```{r depscode_plans2}
file.exists("report.html")
```

R objects and functions are detected implicitly, and you can specify multiple file inputs and outputs per command with the `file_in()`, `knitr_in()`, and `file_out()` functions. With `knitr_in()` R Markdown reports like `report.Rmd` ([online here](https://github.com/ropensci/drake/blob/master/inst/examples/main/report.Rmd)), `drake` scans active code chunks for objects mentioned with `loadd()` and `readd()`. So when `fit` or `hist` change, `drake` rebuilds the `report` target to produce the file `report.html`.

Output files declared with `file_out()` do not appear in `vis_drake_graph()`, but targets can depend on one another through `file_in()`/`file_out()` connections.

```{r fileinfileout_plans}
saveRDS(1, "start.rds")
write_files <- function(){
x <- readRDS(file_in("start.rds"))
for (file in letters[1:3]){
saveRDS(x, file)
}
}
small_plan <- drake_plan(
x = {
write_files()
file_out("a", "b", "c")
},
y = readRDS(file_in("a"))
)
config <- drake_config(small_plan)
vis_drake_graph(config)
```

So when target `x` changes the output for files `"a"`, `"b"`, or `"c"`, `drake` knows to rebuild target `y`.

And remember, `drake` also takes into account imported functions and imported files. If there are nontrivial changes to `start.rds`, `letters`, `readRDS()`, or `saveRDS()`, `drake` will rebuild targets `x` and/or `y` as appropriate.

## Generating large workflow plans

Expand Down Expand Up @@ -346,7 +415,8 @@ Besides the usual columns `target` and `command`, there are other columns you ca
- `worker`: for [paralllel computing](#hpc), optionally name the preferred worker to assign to each target.


```{r enddrake14, echo = FALSE}
drake::clean(destroy = TRUE)
unlink("report.Rmd")
```{r endofline_plans, echo = FALSE}
clean(destroy = TRUE, verbose = FALSE)
unlink(
c("start.rds", "report.Rmd", "raw_data.xlsx", "STDIN.o*", "Thumbs.db"))
```
6 changes: 3 additions & 3 deletions 04-example-packages.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(ggplot2)))
suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(magrittr)))
pkgconfig::set_config("drake::strings_in_dots" = "literals")
clean(destroy = TRUE, verbose = FALSE)
unlink(c("Makefile", "report.Rmd", "shell.sh", "STDIN.o*", "Thumbs.db"))
knitr::opts_chunk$set(
Expand Down Expand Up @@ -82,8 +83,7 @@ data_plan <- drake_plan(
when = "last-month"
),
trigger = "always"
),
strings_in_dots = "literals"
)
)
data_plan
Expand Down Expand Up @@ -138,7 +138,7 @@ in a dynamic knitr report.

```{r reportplanpackages}
report_plan <- drake_plan(
knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
report = knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
)
report_plan
Expand Down
9 changes: 6 additions & 3 deletions 05-example-gsp.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ suppressPackageStartupMessages(library(drake))
suppressPackageStartupMessages(library(Ecdat))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(knitr))
pkgconfig::set_config("drake::strings_in_dots" = "literals")
unlink(".drake", recursive = TRUE)
clean(destroy = TRUE, verbose = FALSE)
unlink(c("Makefile", "report.Rmd", "shell.sh", "STDIN.o*", "Thumbs.db"))
Expand Down Expand Up @@ -127,11 +128,13 @@ At the end, let's generate a pdf plot of the RMSPE scores and a [knitr](https://

```{r masterknitrreport}
output_plan <- drake_plan(
ggsave(
plot = ggsave(
filename = file_out("rmspe.pdf"),
plot = plot_rmspe(rmspe)
plot = plot_rmspe(rmspe),
width = 7,
height = 7
),
knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
report = knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
)
head(output_plan)
Expand Down
121 changes: 0 additions & 121 deletions 07-best-practices.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -124,127 +124,6 @@ Dangers:

In addition, this `source()`-based approach is simply inconvenient. `Drake` rebuilds `my_data` every time `get_data.R` changes, even when those changes are just extra comments or blank lines. On the other hand, in the previous plan that uses `my_data = get_data()`, `drake` does not trigger rebuilds when comments or whitespace in `get_data()` are modified. `Drake` is R-focused, not file-focused. If you embrace this viewpoint, your work will be easier.

### File output targets

In your plan, the `file_out()` function tells `drake` that your target is an external file rather than an ordinary R object.

```{r fileplan}
plan <- drake_plan(
writeLines(text = letters[1:6], con = file_out("file.txt"))
)
plan
```

Now, `make()` knows to expect a file called `file.txt`.

```{r fileplan2}
make(plan)
```

And if you manually mangle `file.txt` by accident, `make()` restores it to its reproducible state.

```{r fileplan3}
writeLines(text = "123", con = file_out("file.txt"))
make(plan)
make(plan)
```

But just because your command produces files does not mean you need to track them.

```{r fileplan5, eval = FALSE}
plan <- drake_plan(real_output = long_job())
make(plan)
list.files()
### [1] "date-time.log" "error.log" "console.log"
```

These log files probably have nothing to do with the objectives of your research. If that is the case, you can safely ignore them with no loss of reproducibility.

Generally speaking, `drake` was designed to be as R-focused as possible, which means you should treat targets as R objects most of the time. External files are really an afterthought. This might be an uncomfortable notion. You may be accustomed to generating lots of files.

```{r fileplan6}
drake_plan(
write.csv(tabulate_results(data), file_out("results.csv")),
ggsave(my_ggplot(data), file = file_out("plot.pdf"))
)
```

But R object targets are much more convenient in the long run. If you really want to display them, consolidate them all in an R Markdown report at the end of the pipeline to reduce the number of output files.

```{r rmdready, echo = FALSE}
invisible(file.create("report.Rmd"))
```

```{r fileplan7}
drake_plan(
tab_results = tabulate_results(data),
data_plot = my_ggplot(data),
rmarkdown::render(
knitr_in("report.Rmd"), # References tab_results` and data_plot in active code chunks using loadd() or readd().
output_file = file_out("report.html")
)
)
```

But sometimes, you may unavoidably have multiple important files for each target. For example, maybe you work with spatial data and use the [`sf` package](https://github.com/r-spatial/sf).

```{r sf, eval = FALSE}
st_write(spatial_data, "spatial_data.shp", driver = "ESRI Shapefile")
### Creates:
### - "spatial_data.shp"
### - "spatial_data.shx"
### - "spatial_data.prj"
### - "spatial_data.dbf"
```

Later targets may depend on many of these files, but there can only be one output file per target. So what do we do? Spoof `drake`: pick one file to be the real target, and let the other files be targets that depend on it.

```{r sf2}
library(drake)
library(magrittr)
drake_plan(
st_write(spatial_data, file_out("spatial_data.shp"), driver = "ESRI Shapefile"),
c(file_out("spatial_data.EXTN"), file_in("spatial_data.shp")),
out = process_shx(file_in("spatial_data.EXTN"))
) %>%
evaluate_plan(wildcard = "EXTN", values = c("shx", "prj", "dbj"))
```

But be warned: If you manually mangle `spatial_data.shx`, `spatial_data.prj` or `spatial_data.dbj` later on, `make()` will not restore them. Having lots of output files can also slow down the construction of workflow plan data frames (ref: [issue 366](https://github.com/ropensci/drake/issues/366)).

It may actually be safer to divide the workflow into two pipelines with separate caches and separate plans. That way, all the output files from the first pipeline, tracked or not tracked, become inputs to the second pipeline. An overarching R script can run both pipelines back to back.

```{r separate, eval = FALSE}
plan1 <- drake_plan(
st_write(spatial_data, file_out("spatial_data.shp"), driver = "ESRI Shapefile")
)
plan2 <- drake_plan(out = process_shx(file_in("spatial_data.EXTN")))%>%
evaluate_plan(wildcard = "EXTN", values = c("shx", "prj", "dbj"))
cache1 <- new_cache(path = "cache1")
cache2 <- new_cache(path = "cache2")
make(plan1, cache = cache1)
make(plan2, cache = cache2)
```

See the [storage guide](#store) for more on caching, particularly functions `get_cache()` and `this_cache()`.

### R Markdown and knitr reports

For a serious project, you should use `drake`'s `make()` function outside `knitr`. In other words, you should treat R Markdown reports and other `knitr` documents as targets and imports, not as a way to run `make()`. Viewed as targets, `drake` makes special exceptions for R Markdown reports and other [knitr](https://github.com/yihui/knitr) reports such as `*.Rmd` and `*.Rnw` files. Not every `drake` project needs them, but it is good practice to use them to summarize the final results of a project once all the other targets have already been built. The mtcars example, for instance, has an R Markdown report. `report.Rmd` is knitted to build `report.md`, which summarizes the final results.

To see where `report.md` will be built, look to the right of the dependency graph.

```{r revisitmtcarsgraph}
load_mtcars_example(overwrite = TRUE) # Get the code with drake_example("mtcars").
config <- drake_config(my_plan)
vis_drake_graph(config)
```

`Drake` treats [knitr](https://github.com/yihui/knitr) report as a special cases. Whenever `drake` sees `knit()` or `render()` ([rmarkdown](https://github.com/rstudio/rmarkdown)) mentioned in a command, it dives into the source file to look for dependencies. Consider `report.Rmd`, which you can view [here](https://github.com/ropensci/drake/blob/master/inst/examples/mtcars/report.Rmd). When `drake` sees `readd(small)` in an active code chunk, it knows [report.Rmd](https://github.com/ropensci/drake/blob/master/inst/examples/mtcars/report.Rmd) depends on the target called `small`, and it draws the appropriate arrow in the dependency graph above. And if `small` ever changes, `make(my_plan)` will re-process [report.Rmd](https://github.com/ropensci/drake/blob/master/inst/examples/mtcars/report.Rmd) to produce the target file `report.md`.

[knitr](https://github.com/yihui/knitr) reports are the only kind of file that `drake` analyzes for dependencies. It does not give R scripts the same special treatment.

### Workflows as R packages

The R package structure is a great way to organize the files of your project. Writing your own package to contain your data science workflow is a good idea, but you will need to
Expand Down
2 changes: 1 addition & 1 deletion 08-vis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Graphs can grow enormous for serious projects, so there are multiple ways to foc
```{r subsetgraph}
vis_drake_graph(
config,
subset = c("regression2_small", file_store("report.md"))
subset = c("regression2_small", "large")
)
```

Expand Down
5 changes: 3 additions & 2 deletions 09-debug.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -379,8 +379,9 @@ drake_examples()
To write the files for an example, use `drake_example()`.

```{r examplesdrake, eval = FALSE}
drake_example("mtcars")
drake_example("slurm")
drake_example("main")
drake_example("packages")
drake_example("gsp")
```

```{r rmfiles_debug, echo = FALSE}
Expand Down
Loading

0 comments on commit 1e20845

Please sign in to comment.