A base R replacement for formatR::tidy_source() #562

wlandau · 2018-10-27T17:58:11Z

Background

I am trying to trim down drake's package dependencies. I removed 7 just yesterday, bringing the count down from 22 to 15. I am not too concerned about base packages like utils, packages in r-lib like withr, or tidyverse packages like dplyr. And it seems infeasible to remove storr, igraph, or codetools. So as I continue on, I hope to find a base R replacement for formatR::tidy_source().

formatR::tidy_source() provides a means of standardizing a target's commands. This standardization removes comments, strange indentation, etc. before drake decides if a command changed since last make(). Clearly, if all we do is add spaces or comments, we don't want drake to rebuild the target.

Requirements

A base R replacement should

Style the text's indentation and spacing in a consistent manner.
Remove comments.
Turn assigment =s and <-'s into ->.

Challenges

R's default parser will take care of (1) and (2), but (3) is tricky.

library(magrittr)
  parse(text = "z = {f('#') # comment
      
x <- 5
    }",
    keep.source = FALSE
  )[[1]] %>%
    deparse() %>%
    cat(sep = "\n")
#> z = {
#>     f("#")
#>     x <- 5
#> }

^{Created on 2018-10-28 by the reprex package (v0.2.1)}
^{Created on 2018-10-28 by the reprex package (v0.2.1)}

Rollout

This change will invalidate all targets in all workflows, so I am postponing the release until drake 7.0.0 (first half of 2019). We should also add a warning in assert_compatible_cache().

The text was updated successfully, but these errors were encountered:

wlandau · 2018-10-28T18:46:21Z

To elaborate: this is drake's current behavior.

drake:::standardize_command
#> function (x) 
#> {
#>     x <- ignore_ignore(x) %>% language_to_text
#>     formatR::tidy_source(source = NULL, comment = FALSE, blank = FALSE, 
#>         arrow = TRUE, brace.newline = FALSE, indent = 4, output = FALSE, 
#>         text = as.character(x), width.cutoff = 119)$text.tidy %>% 
#>         paste(collapse = "\n") %>% braces
#> }
#> <bytecode: 0x4631438>
#> <environment: namespace:drake>

^{Created on 2018-10-28 by the reprex package (v0.2.1)}

wlandau · 2018-10-28T18:47:37Z

Also cc @lorenzwalthert. Do you have advice on turning assignment = into <- using base R?

wlandau · 2018-10-28T19:52:08Z

Relevant: https://stackoverflow.com/questions/53035126/turn-into-using-only-base-r?noredirect=1#comment92971001_53035126

s-fleck · 2018-10-28T20:41:02Z

have you looked at base R`s parse()? that's one step further then just reformatting the code

hrbrmstr · 2018-10-28T20:51:41Z

library(magrittr)

raw_src <- "z = {f('#') # comment

x <- 5
y = 'test'
    }"

# so we can have some tasty parse data
first <- parse(text = raw_src, keep.source = TRUE)

# this makes a nice data frame of the tokenized R source including line and column positions of the source bits
src_info <- getParseData(first, TRUE)

# only care about those blasphemous = assignments
elements_with_equals_assignment <- subset(src_info, token == "EQ_ASSIGN")

# take the source and split it into lines
raw_src_lines <- strsplit(raw_src, "\n")[[1]]

# for as many instances in the data frame replace the = with <-
for (idx in 1:nrow(elements_with_equals_assignment)) {
  stringi::stri_sub(
    raw_src_lines[elements_with_equals_assignment[idx, "line1"]],
    elements_with_equals_assignment[idx, "col1"],
    elements_with_equals_assignment[idx, "col2"]
  ) <- "<-"
}

# put the lines back together and do the thing
parse(
  text = paste0(raw_src_lines, collapse="\n"),
  keep.source = FALSE
)[[1]] %>%
  deparse() %>%
  cat(sep = "\n")
## z <- {
##     f("#")
##     x <- 5
##     y <- "test"
## }

Base R's substring assignment won't layer in bigger width replacements but stringi's will. If stringi is one of those dependencies you're trying to get rid of (unlikely given your willingness to have the 57-package dependency tidyverse along for the ride) then you may want to convert that super-simple approach to some string slicing and dicing.

For folks who have put multiple = assignments on one line (via something like ;) you'll have to bulletproof this a bit by supplying all the positions on that line to stringi::stri_sub() (stringi::stri_sub() supports vectors or a two column matrix for the from/to) or keep track of the yourself in more manual string slicing and dicing.

wlandau · 2018-10-28T22:03:03Z

Thank you so much, @hrbrmstr! Your demo of utils::getParseData() (which I did not know about before) was exactly what I needed to see!

I actually prefer not to depend on stringi because it takes a long time to compile, among other reasons. And as for tidyverse packages, I do not mean the package called tidyverse, just a select few like dplyr and rlang that are already in the DESCRIPTION file.

Anyway, if we go through each line backwards, we can handle equals signs using only base and utils.

standardize_code <- function(x){
  x <- as.character(x)
  parsed <- parse(text = x, keep.source = TRUE)
  info <- utils::getParseData(parsed, includeText = TRUE)
  lines <- replace_equals(lines = strsplit(x, "\n")[[1]], info = info)
  out <- parse(
    text = paste0(lines, collapse="\n"),
    keep.source = FALSE
  )[[1]]
  paste0(deparse(out), collapse = "\n")
}

replace_equals <- function(lines, info){
  equals <- info[info$token == "EQ_ASSIGN", ]
  for (line in unique(equals$line1)){
    line_info <- equals[equals$line1 == line, ]
    for (col in sort(line_info$col1, decreasing = TRUE)){
      lines[line] <- paste0(
        substr(x = lines[line], start = 0, stop = col - 1),
        "<-",
        substr(x = lines[line], start = col + 1, stop = nchar(lines[line]))
      )
    }
  }
  lines
}

y <- standardize_code("z = {f('#'); w = 123; char = \"str\"  # comment

x = y = z
`=`(zz, yy)
x <- 5
5 -> x
  y = 'test'  
z <- function(a = 1){
  sqrt(x = a)
}
function(b = mtcars){
  lm(mpg ~ wt, data = b)
} -> f
}")
cat(y)
#> z <- {
#>     f("#")
#>     w <- 123
#>     char <- "str"
#>     x <- y <- z
#>     zz = yy
#>     x <- 5
#>     x <- 5
#>     y <- "test"
#>     z <- function(a = 1) {
#>         sqrt(x = a)
#>     }
#>     function(b = mtcars) f <- {
#>         lm(mpg ~ wt, data = b)
#>     }
#> }

^{Created on 2018-10-28 by the reprex package (v0.2.1)}

wlandau · 2018-10-28T22:05:51Z

So standardize_command() will look like this:

standardize_command <- function(x) {
  ignore_ignore(x) %>%
    language_to_text() %>%
    standardize_code()
}

wlandau · 2018-10-28T22:13:27Z

From this comment:

+1 but i could break it with '='(zz,yy) and zz = yy = 3

Above, x = y = z becomes x <- y <- z. We retain an equals sign for '='(zz, yy) but that special case is unlikely to crop up in user-side drake commands.

Another thing: the following code

function(b = mtcars){
  lm(mpg ~ wt, data = b)
} -> f

becomes

function(b = mtcars) f <- {
  lm(mpg ~ wt, data = b)
}

I am okay with that. For the purposes of drake:::standardize_command(), the code does not actually need to run. It just needs to be in a standardized format that only depends on what the return value is likely to be. Meaningful edits should change the standardized code, but purely stylistic edits shoild not.

s-fleck · 2018-10-29T07:00:44Z

I wonder if there is any way to utilise the byte compiler for this? something like compile the function, calculate the md5 sum of the byte code, see if it matches the md5 of the last known version. I tried playing around with this yesterday but I could find no way to retrieve the actual byte code (just its memory address)

lorenzwalthert · 2018-10-29T09:52:37Z

I am not sure I can follow, but thanks all involved people to work on this :-)

First, formatR has zero dependencies, so the benefit of replacing functionality from a widely used package with an internal solution for a non-trivial task (problems mentioned throughout the thread) does not seem that beneficial to me.

Second, I am not sure if styling is the way to go if the problem can be understood as caching. I.e. why not hashing the parse data? We already have digest as a first-level dependency, so here is my proposal:

library(magrittr)

assign_one <- "z = {f('#') # comment
x <- 5

y <- 'test'
z <- 4

x2 <- 'test2' 
}"

assign_two <- "z = {f('#') # comment X
x = 5

y <- 'test'
z = 4
'test2' -> x2
}"

hash_code <- function(code) {
  # reverse '->' to '<-' AND keep code valid
  code <- deparse(parse(text = code)[[1]])
  parse_data <- parse(text = code, keep.source = TRUE) %>%
    getParseData(includeText = TRUE)
  change_row <- parse_data$token == "LEFT_ASSIGN"
  parse_data$text[change_row] <- "="
  parse_data$token[change_row] <- "EQ_ASSIGN"
  hash <- parse_data[parse_data$token != "COMMENT" & parse_data$terminal, c("token", "text")] %>%
    c() %>%
    digest::digest()
}



lapply(list(assign_one, assign_two), hash_code)
#> [[1]]
#> [1] "96c34606528d64f2d258731a79837c1b"
#> 
#> [[2]]
#> [1] "96c34606528d64f2d258731a79837c1b"

^{Created on 2018-10-29 by the reprex package (v0.2.1)}

Note that this turns -> into <- in a valid way and then converts both to =, so it meets all requirements outlined above under (3). Credit for reversing -> to r-lib/styler#409 (comment).

Third, allowing any other assignment than = inside drake::drake_plan() is wrong and I believe it should not only throw a warning, but in fact return a plain error, as the resulting plan is not useful anyways.

drake::drake_plan(g <- x)
#> Warning: Use `=` instead of `<-` or `->` to assign targets to commands in `drake_plan()`. For example, write `drake_plan(a = 1)` instead of `drake_plan(a <- 1)`. Arrows were used to declare these commands:
#>   g <- x
#> # A tibble: 1 x 2
#>   target         command
#>   <chr>          <chr>  
#> 1 drake_target_1 g <- x

^{Created on 2018-10-29 by the reprex package (v0.2.1)}

This is really an edge case and getting it right is probably quite hard. So I am not sure why we are trying to create a solution that can hash x <- 1 and 1 -> x to the same value.

Also, the solution hashes z = x <- 3 and z <- x = 3 to the same value. But if this is used within drake::drake_plan() this throws an error already, so I think it's irrelevant to this question (?)

wlandau · 2018-10-30T01:56:10Z

I really hash_code() because it is tidy and explicit. I will introduce it into #563 and give it more thought.

I think this is unrelated to the API function drake_plan(). There, the expected format is drake_plan(target = command).

drake::drake_plan(y = g <- x)
#> # A tibble: 1 x 2
#>   target command
#>   <chr>  <chr>  
#> 1 y      g <- x

wlandau · 2018-10-30T02:08:07Z

Also, I think we can avoid serialization when we hash if we paste everything together first. Could save some time that we could reinvest in a long-ish hash algorithm.

standardize_code <- function(x){
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}

wlandau · 2018-10-30T02:47:20Z

But it still seems to have problems.

library(magrittr)
standardize_code <- function(x){
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}
standardize_code("y=sqrt(x=1)")
#> [1] "e133b7dbb0b12167bd18f31f82ecc79cdf9f6914c256eb5762d749502e5457a4"
standardize_code("y <- sqrt(x = 1)")
#> [1] "2043d80987b0787db97bc408d3bbad3b974e92e4106f3d2a258c8f2d46e5e014"

^{Created on 2018-10-29 by the reprex package (v0.2.1)}

It appears the EQ_ASSIGN gets labeled as EQ_SUB.

library(magrittr)
standardize_code <- function(x){
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")]
}
standardize_code("y=sqrt(x=1)")
#>                   token       text
#> 1  SYMBOL_FUNCTION_CALL expression
#> 2                   '('          (
#> 4            SYMBOL_SUB          y
#> 5                EQ_SUB          =
#> 6  SYMBOL_FUNCTION_CALL       sqrt
#> 7                   '('          (
#> 9            SYMBOL_SUB          x
#> 10               EQ_SUB          =
#> 11            NUM_CONST          1
#> 13                  ')'          )
#> 17                  ')'          )
standardize_code("y <- sqrt(x = 1)")
#>                   token       text
#> 1  SYMBOL_FUNCTION_CALL expression
#> 2                   '('          (
#> 4                SYMBOL          y
#> 5             EQ_ASSIGN          =
#> 7  SYMBOL_FUNCTION_CALL       sqrt
#> 8                   '('          (
#> 10           SYMBOL_SUB          x
#> 11               EQ_SUB          =
#> 12            NUM_CONST          1
#> 14                  ')'          )
#> 18                  ')'          )

^{Created on 2018-10-29 by the reprex package (v0.2.1)}

lorenzwalthert · 2018-10-30T07:52:05Z

You forgot to get rid of the expression term using subsetting, i.e. you have

  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%

but you need

  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%

so

library(magrittr)
standardize_code <- function(x) {
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}
standardize_code("y=sqrt(x=1)")
#> [1] "2f3905f7987e8b54c1cdad95475c8a754a591e5a124a43123df387337abb4866"
standardize_code("y <- sqrt(x = 1)")
#> [1] "2f3905f7987e8b54c1cdad95475c8a754a591e5a124a43123df387337abb4866"

^{Created on 2018-10-30 by the reprex package (v0.2.0.9000)}

wlandau · 2018-10-30T09:58:30Z

Awesome, thanks! With that, we also need to handle empty strings.

standardize_code <- function(x){
  if (!length(x)){
    return(as.character(NA))
  }
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}

And now, it passes drake's test suite. I have updated #563.

wlandau · 2018-10-30T12:46:48Z

A potential speed issue: #563 (comment)

wlandau added the topic: reproducibility label Oct 27, 2018

wlandau mentioned this issue Oct 28, 2018

Consider styler to replace formatR #201

Closed

wlandau added this to the Version 7.0.0 milestone Oct 28, 2018

This was referenced Oct 28, 2018

A base R replacement to formatR::tidy_source() #563

Merged

Assess the feasibility of CodeDepends for all the static code analysis #41

Closed

wlandau added topic: dependencies depends: a future release labels Oct 28, 2018

wlandau pushed a commit that referenced this issue Oct 30, 2018

Fix #562

3b7455e

wlandau mentioned this issue Oct 31, 2018

Review the code for non-constant-time storr operations #566

Closed

wlandau mentioned this issue Dec 6, 2018

Make users pause if the cache was created with an old version of drake. #602

Closed

wlandau pushed a commit that referenced this issue Dec 16, 2018

Fix #562

ae4f4b8

wlandau closed this as completed Dec 16, 2018

wlandau reopened this Dec 16, 2018

wlandau removed depends: a future release labels Jan 2, 2019

wlandau closed this as completed in d199fdb Jan 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A base R replacement for formatR::tidy_source() #562

A base R replacement for formatR::tidy_source() #562

wlandau commented Oct 27, 2018 •

edited

Loading

wlandau commented Oct 28, 2018

wlandau commented Oct 28, 2018

wlandau commented Oct 28, 2018

s-fleck commented Oct 28, 2018 •

edited

Loading

hrbrmstr commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 28, 2018 •

edited

Loading

s-fleck commented Oct 29, 2018 •

edited

Loading

lorenzwalthert commented Oct 29, 2018 •

edited

Loading

wlandau commented Oct 30, 2018

wlandau commented Oct 30, 2018 •

edited

Loading

wlandau commented Oct 30, 2018

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018

wlandau commented Oct 30, 2018

A base R replacement for formatR::tidy_source() #562

A base R replacement for formatR::tidy_source() #562

Comments

wlandau commented Oct 27, 2018 • edited Loading

Background

Requirements

Challenges

Rollout

wlandau commented Oct 28, 2018

wlandau commented Oct 28, 2018

wlandau commented Oct 28, 2018

s-fleck commented Oct 28, 2018 • edited Loading

hrbrmstr commented Oct 28, 2018 • edited Loading

wlandau commented Oct 28, 2018 • edited Loading

wlandau commented Oct 28, 2018 • edited Loading

wlandau commented Oct 28, 2018 • edited Loading

s-fleck commented Oct 29, 2018 • edited Loading

lorenzwalthert commented Oct 29, 2018 • edited Loading

wlandau commented Oct 30, 2018

wlandau commented Oct 30, 2018 • edited Loading

wlandau commented Oct 30, 2018

lorenzwalthert commented Oct 30, 2018

wlandau commented Oct 30, 2018

wlandau commented Oct 30, 2018

wlandau commented Oct 27, 2018 •

edited

Loading

s-fleck commented Oct 28, 2018 •

edited

Loading

hrbrmstr commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 28, 2018 •

edited

Loading

wlandau commented Oct 28, 2018 •

edited

Loading

s-fleck commented Oct 29, 2018 •

edited

Loading

lorenzwalthert commented Oct 29, 2018 •

edited

Loading

wlandau commented Oct 30, 2018 •

edited

Loading