Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour when adding columns by reference in chains #1525

Closed
ChristK opened this issue Feb 8, 2016 · 11 comments
Closed

Inconsistent behaviour when adding columns by reference in chains #1525

ChristK opened this issue Feb 8, 2016 · 11 comments

Comments

@ChristK
Copy link

ChristK commented Feb 8, 2016

See minimal example below

require(data.table) ##v1.9.6 or v1.9.7
dt <- data.table(a = 1:10, b = 1:10)

dt[a<5][b<4] ##works
#    a b
#1: 1 1
#2: 2 2
#3: 3 3

dt[a<5][b<4, c := 3][] ##Seems to work although line 4 shouldn't be there
#    a b  c
#1: 1 1  3
#2: 2 2  3
#3: 3 3  3
#4: 4 4 NA

dt ##PROBLEM: column c is missing
#     a  b
#1:  1  1
#2:  2  2
#3:  3  3
#4:  4  4
#5:  5  5
#6:  6  6
#7:  7  7
#8:  8  8
#9:  9  9
#10: 10 10

If this is intended behaviour could you please add a warning that "new columns have not been assigned" or something along the lines?

sessionInfo()
# R version 3.2.3 (2015-12-10)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1

# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
# [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.7 devtools_1.10.0 

# loaded via a namespace (and not attached):
# [1] httr_1.1.0     R6_2.1.2       tools_3.2.3    withr_1.0.1    rstudioapi_0.5
# [6] curl_0.9.5     memoise_1.0.0  knitr_1.12.3   git2r_0.13.1   digest_0.6.9  
# [11] chron_2.3-47  
@DavidArenburg
Copy link
Member

It works, it just updates the subset, as you could temp <- dt[a<5][b<4, c := 3] ; temp. Btw, for your specific task, you could just do dt[a<5 & b<4, c := 3]

@ChristK
Copy link
Author

ChristK commented Feb 8, 2016

@DavidArenburg Thanks. This is a minimal example of a more complicated task (dt[a<5 & b<4, c := 3] is not as straightforward in my full case). I was hoping I would be able to perform this operation without a copy of the original dt (and dt[a<5 & b<4, c := 3])

Anyway I now see the logic. I'm closing the issue as this is considered expected behaviour.

@ChristK ChristK closed this as completed Feb 8, 2016
@ChristK ChristK changed the title Add column by reference is not working in chains, when subsetting is used Inconsistent behaviour when adding columns by reference in chains Feb 8, 2016
@ChristK
Copy link
Author

ChristK commented Feb 8, 2016

I am reopening this with an updated title. See updated minimal example

require(data.table) ##v1.9.6 or v1.9.7
dt <- data.table(a = 1:3)
dt[a<3, ][a<2, b := 1]
dt ## As the initial example, original dt remains unaltered
#     a
# 1: 1
# 2: 2
# 3: 3

However...

dt[a<3, c := 3][a<2, b := 1]
dt  ##dt has now been updated
#     a  c  b
# 1: 1  3  1
# 2: 2  3 NA
# 3: 3 NA NA

I think that for consistency both operations should behave the same

@ChristK ChristK reopened this Feb 8, 2016
@jangorecki
Copy link
Member

They should not behave the same if there is := (by reference) operator used. If this operator is detected in j argument of data.table query then dataset is altered in place consistently, otherwise the copy is returned, exactly what is happening in dt[a<3,]. This is quite a basic data.table concept, following vignette can be helpful: https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reference-semantics.html if anything is not clear after reading vignette consider providing feedback about the potential gap.

@tdeenes
Copy link
Member

tdeenes commented Feb 8, 2016

As an addiitional hint, consider this:

dt <- data.table(a = 1:10, b = 1:10)
address(dt)
# [1] "0x4e6ed00"
address(dt[a<3])
# [1] "0x5d553e0"
address(dt[a<3, c:= 1])
#  [1] "0x4e6ed00"

dt[a<3] is read as 'return a data.table object which contains the rows from dt where variable a is less than 3', whereas dt[a<3, c := 1] is read as 'for all rows of dt where variable a is less then 3, set variable c equal to 1`.

@ChristK
Copy link
Author

ChristK commented Feb 8, 2016

@jangorecki Yes. However when := only appears later in the chain, the original table is not altered. So the first j of the chain is important for the behaviour of := later in the chain. I think the vignette needs to clarify this, specifically for chain operations. The example by @tdeenes is extremely helpful regarding why this is happening.

At the moment dt[a<4][b<3] is much faster than dt[a<4 & b<3] for large datatables. It is very tempting for the user to try dt[a<4][b<3, c := 1] instead of dt[a<4, c:=NA][b<3, c:=1].

@franknarf1
Copy link
Contributor

Ok, sounds more like you have a documentation request than a finding of "inconsistent behavior". One wrinkle is that DT[][, x := NA] still alters DT.

@ChristK
Copy link
Author

ChristK commented Feb 8, 2016

@franknarf1 Although I now can see how and why this is as it is, from my 'end-user' perspective, dt[a<4][b<3, c := 1] and dt[a<4, c:=NA][b<3, c:=1] should produce the same result. Therefore it is an inconsistency that, at least, needs to be clarified in the documentation. I will summarise this discussion and I will propose the addition of couple of sentences to the semantics vignette in due course.

@jangorecki
Copy link
Member

The := call will return full dataset, it will use i filtering only for assignment in-place. On the other hand lack of := produces a query against data.table, so actually filter resulting dataset.
Good example is:

dt[TRUE==FALSE, z := 1][] # results full datasets
dt[TRUE==FALSE] # results 0 rows dataset

@franknarf1
Copy link
Contributor

Ok. I think I see your point. However, it's just basic R syntax that

X[i, ][, j]

is equivalent to

( X[i, ] )[, j]

and not to X[i,j]. One example is DF = data.frame(a = 1:3) comparing DF[3,]["a"] vs DF[3,"a"]. There is no SQL-like "read the whole query before evaluating" here.

Said another way, f(x)(y) is equivalent to ( f(x) )(y) and generally has no relation to f(x)(y). Anyway, if an update to the vignette is needed, then maybe

Since := is available in j, we can combine it with i and by operations just like the aggregation operations we saw in the previous vignette.

can be extended to

Since := is available in j, we can combine it with i and by operations just like the aggregation operations we saw in the previous vignette. However, because DT[i] returns a subset, DT[i][, colA := valA] will not modify DT itself.

Oh and it might as well mention the new on parameter along side by.

@arunsrinivasan
Copy link
Member

This issue seems to boil down to duplicate of #905. PRs are of course welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants