Get dependencies of knitr reports automatically. #9

wlandau-lilly · 2017-03-12T03:24:55Z

It should be possible to

Recognize a knitr report by the .Rmd or .Rnw file extension.
Extract all the code chunks, including evaluated inline code.
Get the objects read into the report by readd() or loadd()

But this would miss external files. On the other hand, scanning for any mention any target in any code chunk might be too aggressive.

The text was updated successfully, but these errors were encountered:

wlandau-lilly · 2017-03-12T03:31:26Z

Maybe make(.., scan_report = "deep") would be a good compromise, along with a user-side function scan_report() so the user would know in advance what the dependencies will be. There could be a better name than scan_report().

wlandau-lilly · 2017-03-12T03:52:10Z

Interface: the knitr part of the workflow plan data frame would ideally look like

> plan(report.md = "report.Rmd", 
  file_targets = TRUE, strings_in_dots = "filenames")
         target      command
1   'report.md' 'report.Rmd'

then drake would do a preprocessing step to turn that into

         target      command
1   'report.md'  drake::knit_drake('report.Rmd', report.Rmd_dependencies)
2  report.Rmd_dependencies  c("this_target", "that_target", "'other_file'")

The preprocessing step should happen inside config(), just before the dependency graph is constructed.

wlandau-lilly · 2017-03-13T15:47:06Z

The most reliable way to get the code chunks will probably be to dig into the internals of knitr, which will take time. After that, detecting dependencies (and deciding on a level of aggressiveness) is not trivial. I expect this issue to take a long time to solve cleanly.

wlandau-lilly · 2017-04-17T12:56:10Z

At the very least, capture.output(knitr::purl("report.Rmd", output = stdout())) extracts the code chunks from a knitr report. Development knitr can grab inline code with options(knitr.purl.inline = TRUE), but the CRAN version has yet to catch up. I think it would be best to delay this drake issue until then.

wlandau-lilly · 2017-06-14T14:33:00Z

CodeDepends could help with this too.

wlandau-lilly · 2017-06-14T15:15:02Z

In fact, I will most likely solve this with CodeDepends (CRAN, GitHub). This feature is now within reach.

wlandau-lilly · 2017-09-12T02:04:28Z

To be clear, I plan to use CodeDepends to scan dynamic reports for mentions of file targets, calls to readd(), and calls to loadd(). Dynamic reports are automatically identified by file extension: Rmd, Rnw, etc. (We need to find them all.) From there, we can gather up the targets and list them as dependencies. Adding this feature on to deps() and build_graph() should cover it. For the sake of back compatibility, we should deactivate this feature for projects previously built with drake <= 4.1.0. Dynamic reports are so ubiquitous that we need to be careful.

wlandau-lilly · 2017-09-12T17:26:33Z

I forgot to mention: this is a funny situation where imports (e.g. 'report.Rmd') can have dependencies that are targets. Currently, drake does not support this behavior. This is not ideal, but we can work around it. We can simply forward the dependencies of 'report.Rmd' on to 'report.md'

plan(report.md = knit('report.Rmd'), strings_in_dots = "filenames", file_targets = TRUE)

##        target            command
## 1 'report.md' knit('report.Rmd')

In the workflow graph, 'report.Rmd' will have no dependencies (incoming edges). However, 'report.md' will depend on 'report.Rmd', knit(), any file targets mentioned in the code chunks of 'report.Rmd', and any targets loaded into the code chunks with readd() or loadd().

wlandau-lilly · 2017-09-12T18:02:06Z

Almost forgot: need to exclude code chunks with 'eval = FALSE'. I wonder if CodeDepends knows how.

wlandau-lilly · 2017-10-02T10:42:20Z

As I said, we may be able to solve this issue with CodeDepends. From the code chunks of dynamic reports, I want to

Detect all quoted strings.
Detect all symbols passed to the target argument (or just first argument) of readd().
Detect all symbols passed to the ... argument (first argument) of loadd().
Ignore everything else.

Strings are candidates for file targets, and inputs are candidates for other targets. For non-file targets, it is vitally important to only detect inputs to loadd() and read(), ignoring all other inputs. Example code chunk in a file report.Rmd:

library(drake) # Should detect nothing.
var <- 10 # Should detect nothing

# Strings: "could_be_a_file", "so_could_this"
var2 <- list(17, var, "could_be_a_file", "so_could_this")

# Inputs: large.
f(readd(large) + var)

# Strings: "small".
print(drake::readd(target = "small", character_only = TRUE, path = "subdir") + var)

# Strings: "small".
print(drake:::readd("small", character_only = TRUE, cache = NULL) + var)

# Inputs: regression2_large, small. Strings: large
f(loadd(regression2_large, small, list = "large"), var)

First attempt at an input collector:

library(CodeDepends)
drake_handler <- function(e, collector, ...){
  args <- as.list(e)[-1] # Arguments to readd() or loadd()
  
  # Arguments passed to ... in loadd()
  include <- !nchar(names(args))
  if(!any(include)){
    dots <- args[[1]]
  } else {
    dots <- args[include]
  }
  
  candidates <- c(args[["target"]], dots)
  collector$vars(as.character(args), input = TRUE)
}
string_handler <- function(name){
  browser()
  strings <<- c(strings, name)
}
col <- inputCollector(
  readd = drake_handler,
  loadd = drake_handler,
  string = string_handler
)
x <- readScript("report.Rmd") %>%
  getInputs(collector = col)

So far, there are too many inputs in too many places, particularly var. And sometimes other superfluous inputs like character_only appear (a non-target argument to readd()). @gmbecker, any idea what I am doing wrong?

wlandau-lilly · 2017-10-03T12:53:05Z

A first attempt without CodeDepends (but using CodeDepends:::getTangledFrags())

# From https://github.com/duncantl/CodeDepends/blob/master/R/sweave.R#L15
get_tangled_frags <- function(doc, txt = readLines(doc)) {
  in.con <- textConnection(txt)
  out.con <- textConnection("bob", "w", local = TRUE)
  on.exit({
    close(in.con)
    close(out.con)
  })
  knitr::knit(in.con, output = out.con, tangle = TRUE, quiet = TRUE)
  code <- textConnectionValue(out.con)
  parse(text = code)
}

wide_deparse <- function(x){
  paste(deparse(x), collapse = "")
}

library(magrittr)

find_targets <- function(expr, targets = character(0)){
  if (is.function(expr)){
    return(find_targets(body(expr), targets = targets))
  } else if (is.call(expr) & length(this_call <- as.list(expr)) > 1){
    if(deparse(this_call[[1]]) %in% c("readd", "loadd")){
      symbols <- Filter(this_call[-1], f = is.symbol)
      targets <- c(targets, deparse(symbols)) %>%
        unlist %>%
        unique
    }
    deepen_search <- sapply(this_call, function(x){
      grepl("readd|loadd", wide_deparse(x))
    })
    lapply(this_call[deepen_search], find_targets, targets = targets)
  } else if (is.recursive(expr)){
    v <- lapply(as.list(expr), find_targets, targets = targets)
    targets <- unique(c(targets, unlist(v)))
  } 
  targets
}

x <- get_tangled_frags("test.Rmd")
find_targets(x) # incorrectly returns `character(0)`

With test.Rmd:


---
title: "test"
author: "Will Landau"
date: "October 3, 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```r
x <- readd(should_not_find)
```


```{r dry, eval = FALSE}
x <- readd(should_not_find)
```

```{r chunk}
library(drake) # Should detect nothing.
var <- 10 # Should detect nothing

# Strings: "could_be_a_file", "so_could_this"
var2 <- list(17, var, "could_be_a_file", "so_could_this")

# Inputs: large.
f(readd(large) + var)

# Inputs: large.
g <- function(){
  f(readd(large) + var)
}

# Strings: "small".
print(drake::readd(target = "small", character_only = TRUE, path = "subdir") + var)

# Strings: "small".
print(drake:::readd("small", character_only = TRUE, cache = NULL) + var)

# Inputs: regression2_large, small. Strings: large
f(loadd(regression2_large, small, list = "large"), var)
```

wlandau-lilly · 2017-10-03T16:39:30Z

I now have some code in the issue9 branch. It does not use CodeDepends, but I think it's a good enough start anyway. A sketch of the main heavy lifting is there, but I need to figure out how best to accommodate edge cases like loadd() with no arguments, loadd(..., imports_only = TRUE), etc.

wlandau-lilly · 2017-10-03T16:51:18Z

An easier approach than the current code:

In knitr_dependencies(), Just grab all the calls to loadd() and readd() (expressions) rather than a list of targets. (Maybe change the name of this function to knitr_loadd_readd_calls().)
Later, with config$graph in hand, parse all the calls to figure out exactly which targets and imports are referenced. This should be a final step of build_graph(), and it will require us to add more edges post hoc.

wlandau-lilly · 2017-10-04T06:32:09Z

The master branch now has a fix. Now, if knit() appears in a workflow plan command, drake knows that you're knitting a report, and it looks for mentions of targets in loadd()/readd() calls in active code chunks. The report can be just as easily compiled in isolation outside of a make() session. The basic example (load_basic_example(); my_plan) has this built in now, and the quickstart vignette is a verbose version of this. The caution vignette has important caveats. Armed with only static code analysis, I cannot accommodate every edge case, and it would be a messy pain to try. Anyway, super excited about this new development!

gmbecker · 2017-10-04T18:28:55Z

Sorry it took me a bit to look at this. One thing that jumps out is here:

  candidates <- c(args[["target"]], dots)
  collector$vars(as.character(args), input = TRUE)

So you created candidates, but you're using the full args when you register variables with the walker. It's hard to be more specific without the Rmd you want to operate on so that I can actually play with it.

wlandau-lilly · 2017-10-05T22:54:43Z

Thank you for your input, @gmbecker! I think that mistake is definitely part of the problem. However, I am still incorrectly seeing var as an input. I have included a new R script and report.Rmd below.

Just so you know, I am not in a rush. I do want to use this example to get better at CodeDepends, but I also ended up building custom code analysis into drake for this issue a couple days ago. I will also be gone on vacation next week.

library(CodeDepends)
library(magrittr)

drake_handler <- function(e, collector, ...){
  args <- as.list(e)[-1] # Arguments to readd() or loadd()
  # Arguments passed to ... in loadd()
  include <- !nchar(names(args))
  if(!length(include)){
    dots <- args[[1]]
  } else if(!any(include)){
    dots <- args[[1]]
  } else {
    dots <- args[include]
  }
  candidates <- as.character(c(args[["target"]], dots))
  collector$vars(candidates, input = TRUE)
}
string_handler <- function(name){
  strings <<- c(strings, name)
}
col <- inputCollector(
  readd = drake_handler,
  loadd = drake_handler,
  string = string_handler
)
x <- readScript("report.Rmd")
getInputs(x, collector = col)

---
title: "test"
author: "Will Landau"
date: "October 3, 2017"
output: html_document
---

```r
x <- readd(should_not_find)
```

```{r dry, eval = FALSE}
x <- readd(should_not_find)
```

```{r chunk}
library(drake) # Should detect nothing.
var <- 10 # Should detect nothing

# Strings: "could_be_a_file", "so_could_this"
var2 <- list(17, var, "could_be_a_file", "so_could_this")

# Inputs: large.
f(readd(large) + var)

# Inputs: large.
g <- function(){
  f(readd(large) + var)
}

# Strings: "small".
print(drake::readd(target = "small", character_only = TRUE, path = "subdir") + var)

# Strings: "small".
print(drake:::readd("small", character_only = TRUE, cache = NULL) + var)

# Inputs: regression2_large, small. Strings: large
f(loadd(regression2_large, small, list = "large"), var)
```

wlandau-lilly added the type: new feature label Mar 12, 2017

wlandau-lilly added the difficulty: advanced label Apr 3, 2017

wlandau-lilly changed the title ~~Get dependencies of knitr dynamic reports automatically.~~ Get dependencies of knitr reports automatically. Apr 17, 2017

wlandau-lilly added the waiting for dependencies label Apr 17, 2017

wlandau-lilly removed the waiting for dependencies label Jun 14, 2017

This was referenced Jun 14, 2017

Assess the feasibility of CodeDepends for all the static code analysis #41

Closed

CodeDepends as a backend for reproducible build systems duncantl/CodeDepends#14

Closed

wlandau-lilly modified the milestone: v3.1.0 Jun 15, 2017

wlandau-lilly modified the milestones: v4.2.0, v4.2.0 CRAN release Aug 11, 2017

wlandau-lilly mentioned this issue Aug 24, 2017

Reproducible random numbers #56

Closed

wlandau-lilly modified the milestones: v4.2.0 CRAN release, v4.3.0 CRAN release Sep 10, 2017

wlandau-lilly mentioned this issue Sep 12, 2017

Automatically generate a workflow plan data frame from arbitrary R code. #25

Closed

wlandau-lilly modified the milestones: v4.2.0 CRAN release, v4.3.0 CRAN release Sep 12, 2017

wlandau-lilly added the TOP PRIORITY label Sep 24, 2017

wlandau-lilly added status: priority and removed TOP PRIORITY status: priority labels Sep 24, 2017

wlandau-lilly removed this from the v4.2.0 CRAN release milestone Sep 29, 2017

wlandau-lilly added a commit that referenced this issue Oct 3, 2017

Start on #9

d1864c4

wlandau-lilly closed this as completed Oct 4, 2017

wlandau mentioned this issue Feb 14, 2018

Use language to mark input and output files #232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get dependencies of knitr reports automatically. #9

Get dependencies of knitr reports automatically. #9

wlandau-lilly commented Mar 12, 2017

wlandau-lilly commented Mar 12, 2017

wlandau-lilly commented Mar 12, 2017 •

edited

Loading

wlandau-lilly commented Mar 13, 2017

wlandau-lilly commented Apr 17, 2017

wlandau-lilly commented Jun 14, 2017

wlandau-lilly commented Jun 14, 2017

wlandau-lilly commented Sep 12, 2017 •

edited

Loading

wlandau-lilly commented Sep 12, 2017

wlandau-lilly commented Sep 12, 2017

wlandau-lilly commented Oct 2, 2017

wlandau-lilly commented Oct 3, 2017 •

edited

Loading

wlandau-lilly commented Oct 3, 2017

wlandau-lilly commented Oct 3, 2017 •

edited

Loading

wlandau-lilly commented Oct 4, 2017

gmbecker commented Oct 4, 2017

wlandau-lilly commented Oct 5, 2017

Get dependencies of knitr reports automatically. #9

Get dependencies of knitr reports automatically. #9

Comments

wlandau-lilly commented Mar 12, 2017

wlandau-lilly commented Mar 12, 2017

wlandau-lilly commented Mar 12, 2017 • edited Loading

wlandau-lilly commented Mar 13, 2017

wlandau-lilly commented Apr 17, 2017

wlandau-lilly commented Jun 14, 2017

wlandau-lilly commented Jun 14, 2017

wlandau-lilly commented Sep 12, 2017 • edited Loading

wlandau-lilly commented Sep 12, 2017

wlandau-lilly commented Sep 12, 2017

wlandau-lilly commented Oct 2, 2017

wlandau-lilly commented Oct 3, 2017 • edited Loading

wlandau-lilly commented Oct 3, 2017

wlandau-lilly commented Oct 3, 2017 • edited Loading

wlandau-lilly commented Oct 4, 2017

gmbecker commented Oct 4, 2017

wlandau-lilly commented Oct 5, 2017

wlandau-lilly commented Mar 12, 2017 •

edited

Loading

wlandau-lilly commented Sep 12, 2017 •

edited

Loading

wlandau-lilly commented Oct 3, 2017 •

edited

Loading

wlandau-lilly commented Oct 3, 2017 •

edited

Loading