Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeDepends as a backend for reproducible build systems #14

Closed
wlandau-lilly opened this issue Jun 8, 2017 · 12 comments
Closed

CodeDepends as a backend for reproducible build systems #14

wlandau-lilly opened this issue Jun 8, 2017 · 12 comments

Comments

@wlandau-lilly
Copy link
Contributor

I just learned about CodeDepends, and I am glad to find such an active effort in static code analysis. R-focused reproducible build systems like drake and remake (and of course, knitr) could seriously benefit (richfitz/remake#172, ropensci/drake#41). Drake and remake rely on dependency graphs of the relationships among the "targets" of a data analysis project, and the dependencies for these graphs are resolved using code analysis. For drake in particular, I am considering switching the backend from codetools to CodeDepends, and I am wondering if you think the transition would be worth the trouble. In general, what would you say are the main advantages of CodeDepends relative to codetools?

@clarkfitzg
Copy link
Contributor

Drake looks like an interesting system, especially the parallel stuff. Can you point me to a small example analysis that uses it?

Here's an example where I used CodeDepends to generate a graph representing the dependencies between top level expressions in an R script. I use it to map this simulation script into this graph.

You may be able to do something like this to automatically generate the drake code from a simple R script.

From what I've seen CodeDepends has many more tools to resolve dependencies for the graphs you're interested in relative to codetools.

@wlandau-lilly
Copy link
Contributor Author

wlandau-lilly commented Jun 8, 2017

Thanks, @clarkfitzg! Drake has a small built-in canonical example analysis in the quickstart vignette on CRAN, and you can generate the underlying R script with drake::example_drake("basic").

I will look into your CodeDepends example. From your description, it sounds like that might even help with ropensci/drake#25.

There is plenty of room to improve how drake tracks dependencies (also at ropensci/drake#11), and I hope some of those extra tools in CodeDepends can help.

@gmbecker
Copy link
Collaborator

gmbecker commented Jun 8, 2017

Another thing to be aware of, depending on exactly how you intend to use codetools, is that there are corner cases that it currently gets wrong:

> f = function() {x = x + y; x}
> findGlobals(f)
[1] "{" "+" "=" "y"

(x is a global symbol in the above definition of f)
This can wreak havoc when you're doing careful computing on dependencies.

I think you may also be interested in some work I did during my thesis on input-aware caching via RCacheSuite (disclaimer, I haven't looked at that code in a while, though now that CodeDepends is on CRAN I intend to get back to it and push it out as well). Not exactly the same thing as make-like build systems, but related, I think. At least it does the same type of dependency calculation you probably want, using CodeDepends.

@wlandau-lilly
Copy link
Contributor Author

wlandau-lilly commented Jun 8, 2017

@gmbecker interesting. Both drake and remake use storr to cache both code and targets, and RCacheSuite looks like it makes the code part easier.

Incidentally, for drake, I would actually prefer findGlobals(f) to ignore x and include y. (For local variables like x, I can use codetools::findFuncLocals(formals(f), body(f))). But regarding your general point about the havoc of codetools, I totally agree. I am mainly concerned about corner cases like the following.

f <- function(){
  b = get("x", envir = globalenv())
  digest::digest(1234)
}
codetools::findGlobals(f) # Incorrectly ignores x and digest.

And I see that CodeDepends::getInputs(body(f)) does detect digest.

@wlandau-lilly
Copy link
Contributor Author

After digging into the package some more, I have decided against replacing codetools with CodeDepends in the preexisting internals of drake. However, I think CodeDepends will put the otherwise difficult issues ropensci/drake#9 and ropensci/drake#25 within reach, and I look forward to getting started on these new features.

@gmbecker
Copy link
Collaborator

gmbecker commented Jun 14, 2017 via email

@wlandau-lilly
Copy link
Contributor Author

wlandau-lilly commented Jun 14, 2017

I definitely think CodeDepends is more powerful and feature-rich than codetools, and I am still excited to make use of it. But for the specific purpose of building a network of targets in drake, I was originally looking for solutions to the edge cases in ropensci/drake#11. CodeDepends solves at least one of these by detecting functions referenced with :: and :::, but others remain unsolved:

  1. In expressions like get("x", envir = globalenv()), the CodeDepends detects the string "x", but it does not recognize that x is actually a global variable. There are other sneaky uses of strings as in eval(parse()) that I may want to consider.
  2. I like CodeDepends' string detection feature, but it I am not sure it smooths out my current custom code. In drake commands (but not imported functions), single-quoted strings are treated as file names, and double-quoted strings are treated as ordinary string literals. While convenient and flexible for users, this convention makes it hard to construct dependency graphs. CodeDepends replaces single quotes with double quotes in expressions like readRDS('my-file.rds') (which maybe cannot be helped).

I also thought CodeDepends might be trying to replace the entire existence of codetools, but I did see codetools listed as an import in the DESCRIPTION and calls to findGlobals() in the source.

In any case, depending on how CodeDepends changes, and depending on the landscape of code analysis tools in the future, I am open to the possibility of making the switch at some point. But at the moment, I do not think it is worth the time, effort, and real-world testing required to overhaul the internals of drake, and the slightly-altered dependency graphs may require some of my users to rerun their long projects from scratch.

The features in ropensci/drake#11 and ropensci/drake#25 just deal with analyzing knitr reports and importing new projects into drake from R scripts, and I think I can maintain behavior consistent with the current version of drake that is powered by codetools. Users will be able to enable or disable these new features freely, and the functionality will be modular and transparent. I really like that the analysis of scripts and reports is a major focus of CodeDepends.

@wlandau-lilly
Copy link
Contributor Author

wlandau-lilly commented Jun 14, 2017

...I didn't explain that very well, so maybe I should add a tl;dr and a clarification: drake already has mostly satisfactory functionality for tracking dependencies and graphing targets, and after looking at CodeDepends, I think the gains of reimplementing that specific piece would be small. And in my case, any loss of back compatibility is unusually risky. My users have computationally intense projects with extremely long runtimes, and I hesitate to make them re-execute their work from scratch for the sake of incremental improvements.

But I think CodeDepends still fills a large unmet need in drake. Those new features I mentioned could significantly improve the tool, and until now, they seemed like moonshots.

@wlandau-lilly
Copy link
Contributor Author

FYI: this SO post elaborates on a piece of what I am looking for.

@wlandau-lilly
Copy link
Contributor Author

Just found a nifty use for CodeDepends in ropensci/drake#232. I occasionally need to find the globals of expressions. With codetools::findGlobals(), I would first need to coerce each expression into a function, which is trouble if those expressions have multiple language objects.

expr <- parse(text = c("a <- x + y", "b <- v + w"), keep.source = FALSE)
expr

## expression(a <- x + y, b <- v + w)

f <- function(){}
body(f) <- expr

## Warning in `body<-`(`*tmp*`, value = expression(a <- x + y, b <- v + w)) :
##   using the first element of 'value' of type "expression"

codetools::findGlobals(f) # Does not include "a" or "b".

## [1] "<-" "+"  "x"  "y" 

With CodeDepends, this seems easier and safer.

inputs <- CodeDepends::getInputs(expr)
union(
  inputs@inputs,
  names(inputs@functions)
)

## [1] "x" "y" "v" "w" "+"

With the changes happening to drake right now, this is a good time to make the switch.

@wlandau
Copy link

wlandau commented Feb 17, 2018

Another thing I really like: CodeDepends separates out NSE symbols such as the arguments to ggplot2::aes().

@gmbecker
Copy link
Collaborator

gmbecker commented Feb 18, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants