Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A base R replacement for formatR::tidy_source() #562

Closed
wlandau opened this issue Oct 27, 2018 · 16 comments
Closed

A base R replacement for formatR::tidy_source() #562

wlandau opened this issue Oct 27, 2018 · 16 comments

Comments

@wlandau
Copy link
Member

wlandau commented Oct 27, 2018

Background

I am trying to trim down drake's package dependencies. I removed 7 just yesterday, bringing the count down from 22 to 15. I am not too concerned about base packages like utils, packages in r-lib like withr, or tidyverse packages like dplyr. And it seems infeasible to remove storr, igraph, or codetools. So as I continue on, I hope to find a base R replacement for formatR::tidy_source().

formatR::tidy_source() provides a means of standardizing a target's commands. This standardization removes comments, strange indentation, etc. before drake decides if a command changed since last make(). Clearly, if all we do is add spaces or comments, we don't want drake to rebuild the target.

Requirements

A base R replacement should

  1. Style the text's indentation and spacing in a consistent manner.
  2. Remove comments.
  3. Turn assigment =s and <-'s into ->.

Challenges

R's default parser will take care of (1) and (2), but (3) is tricky.

library(magrittr)
  parse(text = "z = {f('#') # comment
      
x <- 5
    }",
    keep.source = FALSE
  )[[1]] %>%
    deparse() %>%
    cat(sep = "\n")
#> z = {
#>     f("#")
#>     x <- 5
#> }

Created on 2018-10-28 by the reprex package (v0.2.1)
Created on 2018-10-28 by the reprex package (v0.2.1)

Rollout

This change will invalidate all targets in all workflows, so I am postponing the release until drake 7.0.0 (first half of 2019). We should also add a warning in assert_compatible_cache().

@wlandau
Copy link
Member Author

wlandau commented Oct 28, 2018

To elaborate: this is drake's current behavior.

drake:::standardize_command
#> function (x) 
#> {
#>     x <- ignore_ignore(x) %>% language_to_text
#>     formatR::tidy_source(source = NULL, comment = FALSE, blank = FALSE, 
#>         arrow = TRUE, brace.newline = FALSE, indent = 4, output = FALSE, 
#>         text = as.character(x), width.cutoff = 119)$text.tidy %>% 
#>         paste(collapse = "\n") %>% braces
#> }
#> <bytecode: 0x4631438>
#> <environment: namespace:drake>

Created on 2018-10-28 by the reprex package (v0.2.1)

@wlandau
Copy link
Member Author

wlandau commented Oct 28, 2018

Also cc @lorenzwalthert. Do you have advice on turning assignment = into <- using base R?

@wlandau
Copy link
Member Author

wlandau commented Oct 28, 2018

@s-fleck
Copy link

s-fleck commented Oct 28, 2018

have you looked at base R`s parse()? that's one step further then just reformatting the code

@hrbrmstr
Copy link

hrbrmstr commented Oct 28, 2018

library(magrittr)

raw_src <- "z = {f('#') # comment

x <- 5
y = 'test'
    }"

# so we can have some tasty parse data
first <- parse(text = raw_src, keep.source = TRUE)

# this makes a nice data frame of the tokenized R source including line and column positions of the source bits
src_info <- getParseData(first, TRUE)

# only care about those blasphemous = assignments
elements_with_equals_assignment <- subset(src_info, token == "EQ_ASSIGN")

# take the source and split it into lines
raw_src_lines <- strsplit(raw_src, "\n")[[1]]

# for as many instances in the data frame replace the = with <-
for (idx in 1:nrow(elements_with_equals_assignment)) {
  stringi::stri_sub(
    raw_src_lines[elements_with_equals_assignment[idx, "line1"]],
    elements_with_equals_assignment[idx, "col1"],
    elements_with_equals_assignment[idx, "col2"]
  ) <- "<-"
}

# put the lines back together and do the thing
parse(
  text = paste0(raw_src_lines, collapse="\n"),
  keep.source = FALSE
)[[1]] %>%
  deparse() %>%
  cat(sep = "\n")
## z <- {
##     f("#")
##     x <- 5
##     y <- "test"
## }

Base R's substring assignment won't layer in bigger width replacements but stringi's will. If stringi is one of those dependencies you're trying to get rid of (unlikely given your willingness to have the 57-package dependency tidyverse along for the ride) then you may want to convert that super-simple approach to some string slicing and dicing.

For folks who have put multiple = assignments on one line (via something like ;) you'll have to bulletproof this a bit by supplying all the positions on that line to stringi::stri_sub() (stringi::stri_sub() supports vectors or a two column matrix for the from/to) or keep track of the yourself in more manual string slicing and dicing.

@wlandau
Copy link
Member Author

wlandau commented Oct 28, 2018

Thank you so much, @hrbrmstr! Your demo of utils::getParseData() (which I did not know about before) was exactly what I needed to see!

I actually prefer not to depend on stringi because it takes a long time to compile, among other reasons. And as for tidyverse packages, I do not mean the package called tidyverse, just a select few like dplyr and rlang that are already in the DESCRIPTION file.

Anyway, if we go through each line backwards, we can handle equals signs using only base and utils.

standardize_code <- function(x){
  x <- as.character(x)
  parsed <- parse(text = x, keep.source = TRUE)
  info <- utils::getParseData(parsed, includeText = TRUE)
  lines <- replace_equals(lines = strsplit(x, "\n")[[1]], info = info)
  out <- parse(
    text = paste0(lines, collapse="\n"),
    keep.source = FALSE
  )[[1]]
  paste0(deparse(out), collapse = "\n")
}

replace_equals <- function(lines, info){
  equals <- info[info$token == "EQ_ASSIGN", ]
  for (line in unique(equals$line1)){
    line_info <- equals[equals$line1 == line, ]
    for (col in sort(line_info$col1, decreasing = TRUE)){
      lines[line] <- paste0(
        substr(x = lines[line], start = 0, stop = col - 1),
        "<-",
        substr(x = lines[line], start = col + 1, stop = nchar(lines[line]))
      )
    }
  }
  lines
}

y <- standardize_code("z = {f('#'); w = 123; char = \"str\"  # comment

x = y = z
`=`(zz, yy)
x <- 5
5 -> x
  y = 'test'  
z <- function(a = 1){
  sqrt(x = a)
}
function(b = mtcars){
  lm(mpg ~ wt, data = b)
} -> f
}")
cat(y)
#> z <- {
#>     f("#")
#>     w <- 123
#>     char <- "str"
#>     x <- y <- z
#>     zz = yy
#>     x <- 5
#>     x <- 5
#>     y <- "test"
#>     z <- function(a = 1) {
#>         sqrt(x = a)
#>     }
#>     function(b = mtcars) f <- {
#>         lm(mpg ~ wt, data = b)
#>     }
#> }

Created on 2018-10-28 by the reprex package (v0.2.1)

@wlandau
Copy link
Member Author

wlandau commented Oct 28, 2018

So standardize_command() will look like this:

standardize_command <- function(x) {
  ignore_ignore(x) %>%
    language_to_text() %>%
    standardize_code()
}

@wlandau
Copy link
Member Author

wlandau commented Oct 28, 2018

From this comment:

+1 but i could break it with '='(zz,yy) and zz = yy = 3

Above, x = y = z becomes x <- y <- z. We retain an equals sign for '='(zz, yy) but that special case is unlikely to crop up in user-side drake commands.

Another thing: the following code

function(b = mtcars){
  lm(mpg ~ wt, data = b)
} -> f

becomes

function(b = mtcars) f <- {
  lm(mpg ~ wt, data = b)
}

I am okay with that. For the purposes of drake:::standardize_command(), the code does not actually need to run. It just needs to be in a standardized format that only depends on what the return value is likely to be. Meaningful edits should change the standardized code, but purely stylistic edits shoild not.

@s-fleck
Copy link

s-fleck commented Oct 29, 2018

I wonder if there is any way to utilise the byte compiler for this? something like compile the function, calculate the md5 sum of the byte code, see if it matches the md5 of the last known version. I tried playing around with this yesterday but I could find no way to retrieve the actual byte code (just its memory address)

@lorenzwalthert
Copy link
Contributor

lorenzwalthert commented Oct 29, 2018

I am not sure I can follow, but thanks all involved people to work on this :-)

First, formatR has zero dependencies, so the benefit of replacing functionality from a widely used package with an internal solution for a non-trivial task (problems mentioned throughout the thread) does not seem that beneficial to me.

Second, I am not sure if styling is the way to go if the problem can be understood as caching. I.e. why not hashing the parse data? We already have digest as a first-level dependency, so here is my proposal:

library(magrittr)

assign_one <- "z = {f('#') # comment
x <- 5

y <- 'test'
z <- 4

x2 <- 'test2' 
}"

assign_two <- "z = {f('#') # comment X
x = 5

y <- 'test'
z = 4
'test2' -> x2
}"

hash_code <- function(code) {
  # reverse '->' to '<-' AND keep code valid
  code <- deparse(parse(text = code)[[1]])
  parse_data <- parse(text = code, keep.source = TRUE) %>%
    getParseData(includeText = TRUE)
  change_row <- parse_data$token == "LEFT_ASSIGN"
  parse_data$text[change_row] <- "="
  parse_data$token[change_row] <- "EQ_ASSIGN"
  hash <- parse_data[parse_data$token != "COMMENT" & parse_data$terminal, c("token", "text")] %>%
    c() %>%
    digest::digest()
}



lapply(list(assign_one, assign_two), hash_code)
#> [[1]]
#> [1] "96c34606528d64f2d258731a79837c1b"
#> 
#> [[2]]
#> [1] "96c34606528d64f2d258731a79837c1b"

Created on 2018-10-29 by the reprex package (v0.2.1)

Note that this turns -> into <- in a valid way and then converts both to =, so it meets all requirements outlined above under (3). Credit for reversing -> to r-lib/styler#409 (comment).

Third, allowing any other assignment than = inside drake::drake_plan() is wrong and I believe it should not only throw a warning, but in fact return a plain error, as the resulting plan is not useful anyways.

drake::drake_plan(g <- x)
#> Warning: Use `=` instead of `<-` or `->` to assign targets to commands in `drake_plan()`. For example, write `drake_plan(a = 1)` instead of `drake_plan(a <- 1)`. Arrows were used to declare these commands:
#>   g <- x
#> # A tibble: 1 x 2
#>   target         command
#>   <chr>          <chr>  
#> 1 drake_target_1 g <- x

Created on 2018-10-29 by the reprex package (v0.2.1)

This is really an edge case and getting it right is probably quite hard. So I am not sure why we are trying to create a solution that can hash x <- 1 and 1 -> x to the same value.

Also, the solution hashes z = x <- 3 and z <- x = 3 to the same value. But if this is used within drake::drake_plan() this throws an error already, so I think it's irrelevant to this question (?)

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

I really hash_code() because it is tidy and explicit. I will introduce it into #563 and give it more thought.

I think this is unrelated to the API function drake_plan(). There, the expected format is drake_plan(target = command).

drake::drake_plan(y = g <- x)
#> # A tibble: 1 x 2
#>   target command
#>   <chr>  <chr>  
#> 1 y      g <- x 

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

Also, I think we can avoid serialization when we hash if we paste everything together first. Could save some time that we could reinvest in a long-ish hash algorithm.

standardize_code <- function(x){
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

But it still seems to have problems.

library(magrittr)
standardize_code <- function(x){
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}
standardize_code("y=sqrt(x=1)")
#> [1] "e133b7dbb0b12167bd18f31f82ecc79cdf9f6914c256eb5762d749502e5457a4"
standardize_code("y <- sqrt(x = 1)")
#> [1] "2043d80987b0787db97bc408d3bbad3b974e92e4106f3d2a258c8f2d46e5e014"

Created on 2018-10-29 by the reprex package (v0.2.1)

It appears the EQ_ASSIGN gets labeled as EQ_SUB.

library(magrittr)
standardize_code <- function(x){
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")]
}
standardize_code("y=sqrt(x=1)")
#>                   token       text
#> 1  SYMBOL_FUNCTION_CALL expression
#> 2                   '('          (
#> 4            SYMBOL_SUB          y
#> 5                EQ_SUB          =
#> 6  SYMBOL_FUNCTION_CALL       sqrt
#> 7                   '('          (
#> 9            SYMBOL_SUB          x
#> 10               EQ_SUB          =
#> 11            NUM_CONST          1
#> 13                  ')'          )
#> 17                  ')'          )
standardize_code("y <- sqrt(x = 1)")
#>                   token       text
#> 1  SYMBOL_FUNCTION_CALL expression
#> 2                   '('          (
#> 4                SYMBOL          y
#> 5             EQ_ASSIGN          =
#> 7  SYMBOL_FUNCTION_CALL       sqrt
#> 8                   '('          (
#> 10           SYMBOL_SUB          x
#> 11               EQ_SUB          =
#> 12            NUM_CONST          1
#> 14                  ')'          )
#> 18                  ')'          )

Created on 2018-10-29 by the reprex package (v0.2.1)

@lorenzwalthert
Copy link
Contributor

You forgot to get rid of the expression term using subsetting, i.e. you have

  x <- deparse(parse(text = as.character(x), keep.source = FALSE)) %>%

but you need

  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%

so

library(magrittr)
standardize_code <- function(x) {
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}
standardize_code("y=sqrt(x=1)")
#> [1] "2f3905f7987e8b54c1cdad95475c8a754a591e5a124a43123df387337abb4866"
standardize_code("y <- sqrt(x = 1)")
#> [1] "2f3905f7987e8b54c1cdad95475c8a754a591e5a124a43123df387337abb4866"

Created on 2018-10-30 by the reprex package (v0.2.0.9000)

wlandau pushed a commit that referenced this issue Oct 30, 2018
@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

Awesome, thanks! With that, we also need to handle empty strings.

standardize_code <- function(x){
  if (!length(x)){
    return(as.character(NA))
  }
  x <- deparse(parse(text = as.character(x), keep.source = FALSE)[[1]]) %>%
    paste(collapse = "\n")
  info <- parse(text = x, keep.source = TRUE) %>%
    utils::getParseData(includeText = TRUE)
  change <- info$token == "LEFT_ASSIGN"
  info$text[change] <- "="
  info$token[change] <- "EQ_ASSIGN"
  info[info$token != "COMMENT" & info$terminal, c("token", "text")] %>%
    lapply(FUN = paste, collapse = " ") %>%
    paste(collapse = " >> ") %>%
    digest::digest(algo = "sha256", serialize = FALSE)
}

And now, it passes drake's test suite. I have updated #563.

@wlandau
Copy link
Member Author

wlandau commented Oct 30, 2018

A potential speed issue: #563 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants