Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substring duplication for dplyr::mutate #373

Closed
tonytonov opened this issue Mar 14, 2018 · 10 comments
Closed

Substring duplication for dplyr::mutate #373

tonytonov opened this issue Mar 14, 2018 · 10 comments

Comments

@tonytonov
Copy link

tonytonov commented Mar 14, 2018

I am seeing a strange issue for a particular combination of commands followed by mutate.
Session info:

 version  R version 3.4.3 (2017-11-30)
 system   x86_64, mingw32             
 ui       RStudio (1.1.423) 
 styler                 1.0.0

Initial code selection (truncated):

df <- bind_rows(df_a, df_d) %>%
  mutate("POP Name" = paste(symbol, "POP"))

After applying styler::style_text(text) over that code, I have it turned into:

df <- bind_rows(df_a, df_d) %>%
  mutate(mutate("POP Name" = paste(symbol, "POP")) = paste(symbol, "POP"))

The issue very inconsistent and only appears if this code is preceded by a specific set of character strings used as variables. I have a full reprex narrowed down, but I cannot share it publicly. If you are interested, please contact me via email and I'll send it.

@lorenzwalthert
Copy link
Collaborator

Thanks @tonytonov for reporting this. I assume you are running styler 1.0.1, which was just released on CRAN a few days ago.
I'd definitively like to have a look at a full reprex and I contacted you via email. We had a few issues related to bugs in the R parser, or that's at least what we thought it was. Most notably #100, #187, #259, #327.
Remotely related: In #216, we found that long strings get truncated when parsed, so we worked around that as well.

@tonytonov
Copy link
Author

Great, the reprex is sent. The issues you mention definitely look related, especially #216. There are exactly slightly more than 1000 characters in the preceding string, so I believe that's why I couldn't narrow the reprex even further down.

@lorenzwalthert
Copy link
Collaborator

lorenzwalthert commented Mar 14, 2018

Interestingly, I cannot reproduce the problem on my machine with the code you sent me. With styler 1.0.1, I get (output truncated):

df <-
  bind_rows(df_a, df_d) %>%
  mutate("POP Name" = paste(symbol, "POP"))

This seems strange since you pointed out in your email that the issue persists with styler 1.0.1

@lorenzwalthert
Copy link
Collaborator

sessioninfo::session_info(remotes::local_package_deps("../styler/"))
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 os       macOS Sierra 10.12.6        
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       Europe/Zurich               
 date     2018-03-14Packages ──────────────────────────────────────────────────────────────────────────────────────────────
 package    * version date       source        
 assertthat   0.2.0   2017-04-11 CRAN (R 3.4.0)
 backports    1.1.2   2017-12-13 CRAN (R 3.4.3)
 cli          1.0.0   2017-11-05 CRAN (R 3.4.2)
 crayon       1.3.4   2017-09-16 CRAN (R 3.4.1)
 enc          0.2.0   2018-03-03 CRAN (R 3.4.3)
 magrittr     1.5     2014-11-22 CRAN (R 3.4.0)
 pillar       1.2.1   2018-02-27 CRAN (R 3.4.3)
 purrr        0.2.4   2017-10-18 CRAN (R 3.4.2)
 rematch2     2.0.1   2017-06-20 CRAN (R 3.4.1)
 rlang        0.2.0   2018-02-20 CRAN (R 3.4.3)
 rprojroot    1.3-2   2018-01-03 CRAN (R 3.4.3)
 tibble       1.4.2   2018-01-22 CRAN (R 3.4.3)
 utf8         1.1.3   2018-01-03 CRAN (R 3.4.3)
 withr        2.1.1   2017-12-19 CRAN (R 3.4.3)

@lorenzwalthert
Copy link
Collaborator

I suggest you @tonytonov to first capture such a sessioninfo before changing any configurations of you system, so we know afterwards what has caused the problem (once we found it out).

@tonytonov
Copy link
Author

Replied via email; in addition to that, here's the current setup:

─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 os       Windows >= 8 x64            
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 tz       Europe/Moscow               
 date     2018-03-14                  

─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────
 package    * version date       source        
 assertthat   0.2.0   2017-04-11 CRAN (R 3.4.3)
 backports    1.1.2   2017-12-13 CRAN (R 3.4.3)
 cli          1.0.0   2017-11-05 CRAN (R 3.4.3)
 crayon       1.3.4   2017-09-16 CRAN (R 3.4.3)
 enc          0.2.0   2018-03-03 CRAN (R 3.4.3)
 magrittr   * 1.5     2014-11-22 CRAN (R 3.4.3)
 pillar       1.2.1   2018-02-27 CRAN (R 3.4.3)
 purrr        0.2.4   2017-10-18 CRAN (R 3.4.3)
 rematch2     2.0.1   2017-06-20 CRAN (R 3.4.3)
 rlang      * 0.2.0   2018-02-20 CRAN (R 3.4.3)
 rprojroot  * 1.3-2   2018-01-03 CRAN (R 3.4.3)
 styler       1.0.1   2018-03-08 CRAN (R 3.4.3)
 tibble       1.4.2   2018-01-22 CRAN (R 3.4.3)
 utf8         1.1.3   2018-01-03 CRAN (R 3.4.3)
 withr        2.1.1   2017-12-19 CRAN (R 3.4.3)

@lorenzwalthert
Copy link
Collaborator

lorenzwalthert commented Mar 14, 2018

Ok, I found the problem. When implementing #230 as a response to #216, we did assume that the parents of all STR_CONST is an expression expr that contains nothing but the string. In verify_str_txt(), we
then replace text of these token with text of their parent. We do this because text of expr is completely parsed, while text of STR_CONST is only parsed correctly if not exceeding a character limit descried in #216.
This works fine for the below example:

styler:::get_parse_data("call(a = 'b')")
#> # A tibble: 9 x 11
#>   line1  col1 line2  col2    id parent token   terminal text  pos_id short
#> * <int> <int> <int> <int> <int>  <int> <chr>   <lgl>    <chr>  <int> <chr>
#> 1     1     1     1    13    11      0 expr    FALSE    call…      1 call(
#> 2     1     1     1     4     1      3 SYMBOL… TRUE     call       2 call 
#> 3     1     1     1     4     3     11 expr    FALSE    call       3 call 
#> 4     1     5     1     5     2     11 '('     TRUE     (          4 (    
#> 5     1     6     1     6     4     11 SYMBOL… TRUE     a          5 a    
#> 6     1     8     1     8     5     11 EQ_SUB  TRUE     =          6 =    
#> 7     1    10     1    12     6      8 STR_CO… TRUE     'b'        7 'b'  
#> 8     1    10     1    12     8     11 expr    FALSE    'b'        8 'b'  
#> 9     1    13     1    13     7     11 ')'     TRUE     )          9 )

Created on 2018-03-14 by the reprex package (v0.2.0).

However, as the following example shows, we cannot assume that the parent only contains the text of STR_CONST.

styler:::get_parse_data("call('a' = 'b')")
#> # A tibble: 9 x 11
#>   line1  col1 line2  col2    id parent token   terminal text  pos_id short
#> * <int> <int> <int> <int> <int>  <int> <chr>   <lgl>    <chr>  <int> <chr>
#> 1     1     1     1    15    11      0 expr    FALSE    call…      1 call(
#> 2     1     1     1     4     1      3 SYMBOL… TRUE     call       2 call 
#> 3     1     1     1     4     3     11 expr    FALSE    call       3 call 
#> 4     1     5     1     5     2     11 '('     TRUE     (          4 (    
#> 5     1     6     1     8     4     11 STR_CO… TRUE     'a'        5 'a'  
#> 6     1    10     1    10     5     11 EQ_SUB  TRUE     =          6 =    
#> 7     1    12     1    14     6      8 STR_CO… TRUE     'b'        7 'b'  
#> 8     1    12     1    14     8     11 expr    FALSE    'b'        8 'b'  
#> 9     1    15     1    15     7     11 ')'     TRUE     )          9 )

Created on 2018-03-14 by the reprex package (v0.2.0).

The parent of 4 is 11, which contains the whole expression. If there is any long string in code, all text of STR_CONST will be replaced with the text of their parents. This is why your @tonytonov example fails.
Hence, we showed that the solution #230 works for quoted arguments, but not for quoted names of the arguments if a long string is present anywhere in the code. Hence, we must find another way of getting text of long strings.
I think we implemented the solution with the parent because there were concerns that extracting text is brittle (#216 (comment)). However, I think there is no way around this. One alternative could be to only replace text of STR_CONST for strings that actually need a replacement (instead of all), but that is not elegant because long quoted argument names would still cause malformatting.

For that reason, I think we can close this issue and repoen #216.

@lorenzwalthert
Copy link
Collaborator

@tonytonov I just merged #384 into master and I think this problem should now be solved via closing #216. Can you confirm this? The solution should also work for the long example you sent me via email.

@tonytonov
Copy link
Author

@lorenzwalthert Thanks for the update! I can no longer reproduce the issue using the development version. I assume this will be included into the next CRAN release? (presumably 1.0.2?)

@lorenzwalthert
Copy link
Collaborator

lorenzwalthert commented Apr 13, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants