Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get dependencies of knitr reports automatically. #9

Closed
wlandau-lilly opened this issue Mar 12, 2017 · 16 comments
Closed

Get dependencies of knitr reports automatically. #9

wlandau-lilly opened this issue Mar 12, 2017 · 16 comments

Comments

@wlandau-lilly
Copy link
Collaborator

It should be possible to

  1. Recognize a knitr report by the .Rmd or .Rnw file extension.
  2. Extract all the code chunks, including evaluated inline code.
  3. Get the objects read into the report by readd() or loadd()

But this would miss external files. On the other hand, scanning for any mention any target in any code chunk might be too aggressive.

@wlandau-lilly
Copy link
Collaborator Author

Maybe make(.., scan_report = "deep") would be a good compromise, along with a user-side function scan_report() so the user would know in advance what the dependencies will be. There could be a better name than scan_report().

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Mar 12, 2017

Interface: the knitr part of the workflow plan data frame would ideally look like

> plan(report.md = "report.Rmd", 
  file_targets = TRUE, strings_in_dots = "filenames")
         target      command
1   'report.md' 'report.Rmd'

then drake would do a preprocessing step to turn that into

         target      command
1   'report.md'  drake::knit_drake('report.Rmd', report.Rmd_dependencies)
2  report.Rmd_dependencies  c("this_target", "that_target", "'other_file'")

The preprocessing step should happen inside config(), just before the dependency graph is constructed.

@wlandau-lilly
Copy link
Collaborator Author

The most reliable way to get the code chunks will probably be to dig into the internals of knitr, which will take time. After that, detecting dependencies (and deciding on a level of aggressiveness) is not trivial. I expect this issue to take a long time to solve cleanly.

@wlandau-lilly
Copy link
Collaborator Author

At the very least, capture.output(knitr::purl("report.Rmd", output = stdout())) extracts the code chunks from a knitr report. Development knitr can grab inline code with options(knitr.purl.inline = TRUE), but the CRAN version has yet to catch up. I think it would be best to delay this drake issue until then.

@wlandau-lilly wlandau-lilly changed the title Get dependencies of knitr dynamic reports automatically. Get dependencies of knitr reports automatically. Apr 17, 2017
@wlandau-lilly
Copy link
Collaborator Author

CodeDepends could help with this too.

@wlandau-lilly
Copy link
Collaborator Author

In fact, I will most likely solve this with CodeDepends (CRAN, GitHub). This feature is now within reach.

@wlandau-lilly wlandau-lilly modified the milestone: v3.1.0 Jun 15, 2017
@wlandau-lilly wlandau-lilly modified the milestones: v4.2.0, v4.2.0 CRAN release Aug 11, 2017
@wlandau-lilly wlandau-lilly modified the milestones: v4.2.0 CRAN release, v4.3.0 CRAN release Sep 10, 2017
@wlandau-lilly wlandau-lilly modified the milestones: v4.2.0 CRAN release, v4.3.0 CRAN release Sep 12, 2017
@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Sep 12, 2017

To be clear, I plan to use CodeDepends to scan dynamic reports for mentions of file targets, calls to readd(), and calls to loadd(). Dynamic reports are automatically identified by file extension: Rmd, Rnw, etc. (We need to find them all.) From there, we can gather up the targets and list them as dependencies. Adding this feature on to deps() and build_graph() should cover it. For the sake of back compatibility, we should deactivate this feature for projects previously built with drake <= 4.1.0. Dynamic reports are so ubiquitous that we need to be careful.

@wlandau-lilly
Copy link
Collaborator Author

I forgot to mention: this is a funny situation where imports (e.g. 'report.Rmd') can have dependencies that are targets. Currently, drake does not support this behavior. This is not ideal, but we can work around it. We can simply forward the dependencies of 'report.Rmd' on to 'report.md'

plan(report.md = knit('report.Rmd'), strings_in_dots = "filenames", file_targets = TRUE)

##        target            command
## 1 'report.md' knit('report.Rmd')

In the workflow graph, 'report.Rmd' will have no dependencies (incoming edges). However, 'report.md' will depend on 'report.Rmd', knit(), any file targets mentioned in the code chunks of 'report.Rmd', and any targets loaded into the code chunks with readd() or loadd().

@wlandau-lilly
Copy link
Collaborator Author

Almost forgot: need to exclude code chunks with 'eval = FALSE'. I wonder if CodeDepends knows how.

@wlandau-lilly
Copy link
Collaborator Author

As I said, we may be able to solve this issue with CodeDepends. From the code chunks of dynamic reports, I want to

  • Detect all quoted strings.
  • Detect all symbols passed to the target argument (or just first argument) of readd().
  • Detect all symbols passed to the ... argument (first argument) of loadd().
  • Ignore everything else.

Strings are candidates for file targets, and inputs are candidates for other targets. For non-file targets, it is vitally important to only detect inputs to loadd() and read(), ignoring all other inputs. Example code chunk in a file report.Rmd:

library(drake) # Should detect nothing.
var <- 10 # Should detect nothing

# Strings: "could_be_a_file", "so_could_this"
var2 <- list(17, var, "could_be_a_file", "so_could_this")

# Inputs: large.
f(readd(large) + var)

# Strings: "small".
print(drake::readd(target = "small", character_only = TRUE, path = "subdir") + var)

# Strings: "small".
print(drake:::readd("small", character_only = TRUE, cache = NULL) + var)

# Inputs: regression2_large, small. Strings: large
f(loadd(regression2_large, small, list = "large"), var)

First attempt at an input collector:

library(CodeDepends)
drake_handler <- function(e, collector, ...){
  args <- as.list(e)[-1] # Arguments to readd() or loadd()
  
  # Arguments passed to ... in loadd()
  include <- !nchar(names(args))
  if(!any(include)){
    dots <- args[[1]]
  } else {
    dots <- args[include]
  }
  
  candidates <- c(args[["target"]], dots)
  collector$vars(as.character(args), input = TRUE)
}
string_handler <- function(name){
  browser()
  strings <<- c(strings, name)
}
col <- inputCollector(
  readd = drake_handler,
  loadd = drake_handler,
  string = string_handler
)
x <- readScript("report.Rmd") %>%
  getInputs(collector = col)

So far, there are too many inputs in too many places, particularly var. And sometimes other superfluous inputs like character_only appear (a non-target argument to readd()). @gmbecker, any idea what I am doing wrong?

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Oct 3, 2017

A first attempt without CodeDepends (but using CodeDepends:::getTangledFrags())

# From https://github.com/duncantl/CodeDepends/blob/master/R/sweave.R#L15
get_tangled_frags <- function(doc, txt = readLines(doc)) {
  in.con <- textConnection(txt)
  out.con <- textConnection("bob", "w", local = TRUE)
  on.exit({
    close(in.con)
    close(out.con)
  })
  knitr::knit(in.con, output = out.con, tangle = TRUE, quiet = TRUE)
  code <- textConnectionValue(out.con)
  parse(text = code)
}

wide_deparse <- function(x){
  paste(deparse(x), collapse = "")
}

library(magrittr)

find_targets <- function(expr, targets = character(0)){
  if (is.function(expr)){
    return(find_targets(body(expr), targets = targets))
  } else if (is.call(expr) & length(this_call <- as.list(expr)) > 1){
    if(deparse(this_call[[1]]) %in% c("readd", "loadd")){
      symbols <- Filter(this_call[-1], f = is.symbol)
      targets <- c(targets, deparse(symbols)) %>%
        unlist %>%
        unique
    }
    deepen_search <- sapply(this_call, function(x){
      grepl("readd|loadd", wide_deparse(x))
    })
    lapply(this_call[deepen_search], find_targets, targets = targets)
  } else if (is.recursive(expr)){
    v <- lapply(as.list(expr), find_targets, targets = targets)
    targets <- unique(c(targets, unlist(v)))
  } 
  targets
}

x <- get_tangled_frags("test.Rmd")
find_targets(x) # incorrectly returns `character(0)`

With test.Rmd:


---
title: "test"
author: "Will Landau"
date: "October 3, 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```r
x <- readd(should_not_find)
```


```{r dry, eval = FALSE}
x <- readd(should_not_find)
```

```{r chunk}
library(drake) # Should detect nothing.
var <- 10 # Should detect nothing

# Strings: "could_be_a_file", "so_could_this"
var2 <- list(17, var, "could_be_a_file", "so_could_this")

# Inputs: large.
f(readd(large) + var)

# Inputs: large.
g <- function(){
  f(readd(large) + var)
}

# Strings: "small".
print(drake::readd(target = "small", character_only = TRUE, path = "subdir") + var)

# Strings: "small".
print(drake:::readd("small", character_only = TRUE, cache = NULL) + var)

# Inputs: regression2_large, small. Strings: large
f(loadd(regression2_large, small, list = "large"), var)
```

wlandau-lilly added a commit that referenced this issue Oct 3, 2017
@wlandau-lilly
Copy link
Collaborator Author

I now have some code in the issue9 branch. It does not use CodeDepends, but I think it's a good enough start anyway. A sketch of the main heavy lifting is there, but I need to figure out how best to accommodate edge cases like loadd() with no arguments, loadd(..., imports_only = TRUE), etc.

@wlandau-lilly
Copy link
Collaborator Author

wlandau-lilly commented Oct 3, 2017

An easier approach than the current code:

  1. In knitr_dependencies(), Just grab all the calls to loadd() and readd() (expressions) rather than a list of targets. (Maybe change the name of this function to knitr_loadd_readd_calls().)
  2. Later, with config$graph in hand, parse all the calls to figure out exactly which targets and imports are referenced. This should be a final step of build_graph(), and it will require us to add more edges post hoc.

@wlandau-lilly
Copy link
Collaborator Author

The master branch now has a fix. Now, if knit() appears in a workflow plan command, drake knows that you're knitting a report, and it looks for mentions of targets in loadd()/readd() calls in active code chunks. The report can be just as easily compiled in isolation outside of a make() session. The basic example (load_basic_example(); my_plan) has this built in now, and the quickstart vignette is a verbose version of this. The caution vignette has important caveats. Armed with only static code analysis, I cannot accommodate every edge case, and it would be a messy pain to try. Anyway, super excited about this new development!

@gmbecker
Copy link

gmbecker commented Oct 4, 2017

Sorry it took me a bit to look at this. One thing that jumps out is here:

  candidates <- c(args[["target"]], dots)
  collector$vars(as.character(args), input = TRUE)

So you created candidates, but you're using the full args when you register variables with the walker. It's hard to be more specific without the Rmd you want to operate on so that I can actually play with it.

@wlandau-lilly
Copy link
Collaborator Author

Thank you for your input, @gmbecker! I think that mistake is definitely part of the problem. However, I am still incorrectly seeing var as an input. I have included a new R script and report.Rmd below.

Just so you know, I am not in a rush. I do want to use this example to get better at CodeDepends, but I also ended up building custom code analysis into drake for this issue a couple days ago. I will also be gone on vacation next week.

library(CodeDepends)
library(magrittr)

drake_handler <- function(e, collector, ...){
  args <- as.list(e)[-1] # Arguments to readd() or loadd()
  # Arguments passed to ... in loadd()
  include <- !nchar(names(args))
  if(!length(include)){
    dots <- args[[1]]
  } else if(!any(include)){
    dots <- args[[1]]
  } else {
    dots <- args[include]
  }
  candidates <- as.character(c(args[["target"]], dots))
  collector$vars(candidates, input = TRUE)
}
string_handler <- function(name){
  strings <<- c(strings, name)
}
col <- inputCollector(
  readd = drake_handler,
  loadd = drake_handler,
  string = string_handler
)
x <- readScript("report.Rmd")
getInputs(x, collector = col)
---
title: "test"
author: "Will Landau"
date: "October 3, 2017"
output: html_document
---

```r
x <- readd(should_not_find)
```

```{r dry, eval = FALSE}
x <- readd(should_not_find)
```

```{r chunk}
library(drake) # Should detect nothing.
var <- 10 # Should detect nothing

# Strings: "could_be_a_file", "so_could_this"
var2 <- list(17, var, "could_be_a_file", "so_could_this")

# Inputs: large.
f(readd(large) + var)

# Inputs: large.
g <- function(){
  f(readd(large) + var)
}

# Strings: "small".
print(drake::readd(target = "small", character_only = TRUE, path = "subdir") + var)

# Strings: "small".
print(drake:::readd("small", character_only = TRUE, cache = NULL) + var)

# Inputs: regression2_large, small. Strings: large
f(loadd(regression2_large, small, list = "large"), var)
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants