Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dplyr::mutate shadow copy #5557

Open
kongdd opened this issue Dec 8, 2022 · 4 comments
Open

dplyr::mutate shadow copy #5557

kongdd opened this issue Dec 8, 2022 · 4 comments

Comments

@kongdd
Copy link

kongdd commented Dec 8, 2022

After create a new variable through mutate(y = x), change y, x is also modified.

library(data.table)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

d = data.table(x = 1:4) %>% mutate(y = x)
d[x == 1, y := NA_integer_]
print(d)
#>     x  y
#> 1: NA NA
#> 2:  2  2
#> 3:  3  3
#> 4:  4  4

sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.10      data.table_1.14.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.2.2    pillar_1.8.1      highr_0.9         R.methodsS3_1.8.2
#>  [5] R.utils_2.12.2    tools_4.2.2       digest_0.6.30     evaluate_0.18    
#>  [9] lifecycle_1.0.3   tibble_3.1.8      R.cache_0.16.0    pkgconfig_2.0.3  
#> [13] rlang_1.0.6       reprex_2.0.2      cli_3.4.1         DBI_1.1.3        
#> [17] rstudioapi_0.14   yaml_2.3.6        xfun_0.35         fastmap_1.1.0    
#> [21] withr_2.5.0       styler_1.8.1      stringr_1.5.0     knitr_1.41       
#> [25] generics_0.1.3    fs_1.5.2          vctrs_0.5.1       tidyselect_1.2.0 
#> [29] glue_1.6.2        R6_2.5.1          fansi_1.0.3       rmarkdown_2.18   
#> [33] purrr_0.3.5       magrittr_2.0.3    htmltools_0.5.3   assertthat_0.2.1 
#> [37] utf8_1.2.2        stringi_1.7.8     R.oo_1.25.0

Created on 2022-12-08 with reprex v2.0.2

@avimallu
Copy link
Contributor

avimallu commented Dec 8, 2022

I think that happens because mutate trying to be smart here (and rightly so) doesn't work out well for data.table:

library(lobstr)
library(data.table)
d = data.table(x = 1:4) %>% mutate(y = x)
obj_addrs(d)
[1] "0x26388b06688" "0x26388b069c8"

You'll notice that the addresses for both x and y columns are identical. If you want a quick solution to this, a little expensive way to do this (particularly if d is large) is:

as.data.table(d)[x==1, y:=NA_integer_][]

There was some discussion around detecting if other packages had modified data.table, such as in #5084, and a fix like that specific to checking if column vectors have identical addresses may be needed.

@jangorecki
Copy link
Member

Another way is to incorporate reference counting. This would be a big change, and not easy to handle well. For example environment browser panel in RStudio IDE creates references to objects from global env. AFAIK when you create a variable in RStudio, it has 2 references instead of one.

@tlapak
Copy link
Contributor

tlapak commented Feb 14, 2023

I have only skimmed the dplyr code but I don't think this is on them. This seems to be just standard R behaviour. One way to resolve this without reference counting would be to check for duplicate references in the top level list of data.table and then create copies for those. (A different way would be to nicely ask dplyr to sort this out for us. But that's really just a bandaid and I don't think they'd be too thrilled about this...)

@TysonStanley
Copy link
Member

Yes, does not look like a data.table issue since the behavior is the same for data.frames.

library(dplyr)
library(lobstr)
library(data.table)

d = data.table(x = 1:4) %>% mutate(y = x)
obj_addrs(d)
#> [1] "0x107a85b48" "0x107a85b48"

df = data.frame(x = 1:4) %>% mutate(y = x)
obj_addrs(df)
#> [1] "0x1071cf9e0" "0x1071cf9e0"

Created on 2024-07-07 with reprex v2.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants