Inconsistent behaviour when adding columns by reference in chains #1525

ChristK · 2016-02-08T10:45:52Z

See minimal example below

require(data.table) ##v1.9.6 or v1.9.7
dt <- data.table(a = 1:10, b = 1:10)

dt[a<5][b<4] ##works
#    a b
#1: 1 1
#2: 2 2
#3: 3 3

dt[a<5][b<4, c := 3][] ##Seems to work although line 4 shouldn't be there
#    a b  c
#1: 1 1  3
#2: 2 2  3
#3: 3 3  3
#4: 4 4 NA

dt ##PROBLEM: column c is missing
#     a  b
#1:  1  1
#2:  2  2
#3:  3  3
#4:  4  4
#5:  5  5
#6:  6  6
#7:  7  7
#8:  8  8
#9:  9  9
#10: 10 10

If this is intended behaviour could you please add a warning that "new columns have not been assigned" or something along the lines?

sessionInfo()
# R version 3.2.3 (2015-12-10)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1

# locale:
# [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
# [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] data.table_1.9.7 devtools_1.10.0 

# loaded via a namespace (and not attached):
# [1] httr_1.1.0     R6_2.1.2       tools_3.2.3    withr_1.0.1    rstudioapi_0.5
# [6] curl_0.9.5     memoise_1.0.0  knitr_1.12.3   git2r_0.13.1   digest_0.6.9  
# [11] chron_2.3-47

DavidArenburg · 2016-02-08T11:02:08Z

It works, it just updates the subset, as you could temp <- dt[a<5][b<4, c := 3] ; temp. Btw, for your specific task, you could just do dt[a<5 & b<4, c := 3]

ChristK · 2016-02-08T11:09:54Z

@DavidArenburg Thanks. This is a minimal example of a more complicated task (dt[a<5 & b<4, c := 3] is not as straightforward in my full case). I was hoping I would be able to perform this operation without a copy of the original dt (and dt[a<5 & b<4, c := 3])

Anyway I now see the logic. I'm closing the issue as this is considered expected behaviour.

ChristK · 2016-02-08T12:01:41Z

I am reopening this with an updated title. See updated minimal example

require(data.table) ##v1.9.6 or v1.9.7
dt <- data.table(a = 1:3)
dt[a<3, ][a<2, b := 1]
dt ## As the initial example, original dt remains unaltered
#     a
# 1: 1
# 2: 2
# 3: 3

However...

dt[a<3, c := 3][a<2, b := 1]
dt  ##dt has now been updated
#     a  c  b
# 1: 1  3  1
# 2: 2  3 NA
# 3: 3 NA NA

I think that for consistency both operations should behave the same

jangorecki · 2016-02-08T12:18:12Z

They should not behave the same if there is := (by reference) operator used. If this operator is detected in j argument of data.table query then dataset is altered in place consistently, otherwise the copy is returned, exactly what is happening in dt[a<3,]. This is quite a basic data.table concept, following vignette can be helpful: https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-reference-semantics.html if anything is not clear after reading vignette consider providing feedback about the potential gap.

tdeenes · 2016-02-08T12:22:18Z

As an addiitional hint, consider this:

dt <- data.table(a = 1:10, b = 1:10)
address(dt)
# [1] "0x4e6ed00"
address(dt[a<3])
# [1] "0x5d553e0"
address(dt[a<3, c:= 1])
#  [1] "0x4e6ed00"

dt[a<3] is read as 'return a data.table object which contains the rows from dt where variable a is less than 3', whereas dt[a<3, c := 1] is read as 'for all rows of dt where variable a is less then 3, set variable c equal to 1`.

ChristK · 2016-02-08T12:58:30Z

@jangorecki Yes. However when := only appears later in the chain, the original table is not altered. So the first j of the chain is important for the behaviour of := later in the chain. I think the vignette needs to clarify this, specifically for chain operations. The example by @tdeenes is extremely helpful regarding why this is happening.

At the moment dt[a<4][b<3] is much faster than dt[a<4 & b<3] for large datatables. It is very tempting for the user to try dt[a<4][b<3, c := 1] instead of dt[a<4, c:=NA][b<3, c:=1].

franknarf1 · 2016-02-08T13:56:51Z

Ok, sounds more like you have a documentation request than a finding of "inconsistent behavior". One wrinkle is that DT[][, x := NA] still alters DT.

ChristK · 2016-02-08T14:21:01Z

@franknarf1 Although I now can see how and why this is as it is, from my 'end-user' perspective, dt[a<4][b<3, c := 1] and dt[a<4, c:=NA][b<3, c:=1] should produce the same result. Therefore it is an inconsistency that, at least, needs to be clarified in the documentation. I will summarise this discussion and I will propose the addition of couple of sentences to the semantics vignette in due course.

jangorecki · 2016-02-08T14:31:18Z

The := call will return full dataset, it will use i filtering only for assignment in-place. On the other hand lack of := produces a query against data.table, so actually filter resulting dataset.
Good example is:

dt[TRUE==FALSE, z := 1][] # results full datasets
dt[TRUE==FALSE] # results 0 rows dataset

franknarf1 · 2016-02-08T15:56:34Z

Ok. I think I see your point. However, it's just basic R syntax that

X[i, ][, j]

is equivalent to

( X[i, ] )[, j]

and not to X[i,j]. One example is DF = data.frame(a = 1:3) comparing DF[3,]["a"] vs DF[3,"a"]. There is no SQL-like "read the whole query before evaluating" here.

Said another way, f(x)(y) is equivalent to ( f(x) )(y) and generally has no relation to f(x)(y). Anyway, if an update to the vignette is needed, then maybe

Since := is available in j, we can combine it with i and by operations just like the aggregation operations we saw in the previous vignette.

can be extended to

Since := is available in j, we can combine it with i and by operations just like the aggregation operations we saw in the previous vignette. However, because DT[i] returns a subset, DT[i][, colA := valA] will not modify DT itself.

Oh and it might as well mention the new on parameter along side by.

arunsrinivasan · 2016-02-08T17:54:19Z

This issue seems to boil down to duplicate of #905. PRs are of course welcome.

ChristK closed this as completed Feb 8, 2016

ChristK changed the title ~~Add column by reference is not working in chains, when subsetting is used~~ Inconsistent behaviour when adding columns by reference in chains Feb 8, 2016

ChristK reopened this Feb 8, 2016

arunsrinivasan closed this as completed Feb 8, 2016

arunsrinivasan added duplicate documentation labels Feb 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour when adding columns by reference in chains #1525

Inconsistent behaviour when adding columns by reference in chains #1525

ChristK commented Feb 8, 2016

DavidArenburg commented Feb 8, 2016

ChristK commented Feb 8, 2016

ChristK commented Feb 8, 2016

jangorecki commented Feb 8, 2016

tdeenes commented Feb 8, 2016

ChristK commented Feb 8, 2016

franknarf1 commented Feb 8, 2016

ChristK commented Feb 8, 2016

jangorecki commented Feb 8, 2016

franknarf1 commented Feb 8, 2016

arunsrinivasan commented Feb 8, 2016

Inconsistent behaviour when adding columns by reference in chains #1525

Inconsistent behaviour when adding columns by reference in chains #1525

Comments

ChristK commented Feb 8, 2016

DavidArenburg commented Feb 8, 2016

ChristK commented Feb 8, 2016

ChristK commented Feb 8, 2016

jangorecki commented Feb 8, 2016

tdeenes commented Feb 8, 2016

ChristK commented Feb 8, 2016

franknarf1 commented Feb 8, 2016

ChristK commented Feb 8, 2016

jangorecki commented Feb 8, 2016

franknarf1 commented Feb 8, 2016

arunsrinivasan commented Feb 8, 2016