Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding keyby changes in v1.9.8 #1958

Closed
PeteHaitch opened this issue Dec 8, 2016 · 7 comments
Closed

Understanding keyby changes in v1.9.8 #1958

PeteHaitch opened this issue Dec 8, 2016 · 7 comments

Comments

@PeteHaitch
Copy link

I encountered some failing unit tests in a package of mine that depends on data.table. I think this change in behaviour is due to point 30 at https://github.com/Rdatatable/data.table/blob/master/NEWS.md#new-features-1) and discussed in #1880. I don't mind changing my code to keep up-to-date with changes in data.table, but I don't understand exactly what is happening and would like to.

Here's a reproducible example illustrating how I used to get the desired behaviour in v1.9.6, how that no longer works as of v1.9.8 (or v1.10.0), and my workaround in v1.9.8 (or v1.10.0).

Advice on how these changes have affected me and what the 'data.table'-way to achieve my desired behaviour is much appreciated.

Using data.table v1.9.6

library(data.table)

# Test data
y <- data.table(seqnames = c("chr1", rep("chr2", 3), rep("chr1", 2), 
                             rep("chr3", 4)),
                strand = c("-", rep("+", 2), rep("*", 2), rep("+", 3), 
                           rep("-", 2)),
                pos1 = 1:10,
                pos2 = 2:11, 
                pos3 = 3:12)
y
#>     seqnames strand pos1 pos2 pos3
#>  1:     chr1      -    1    2    3
#>  2:     chr2      +    2    3    4
#>  3:     chr2      +    3    4    5
#>  4:     chr2      *    4    5    6
#>  5:     chr1      *    5    6    7
#>  6:     chr1      +    6    7    8
#>  7:     chr3      +    7    8    9
#>  8:     chr3      +    8    9   10
#>  9:     chr3      -    9   10   11
#> 10:     chr3      -   10   11   12
my_key <- c("seqnames", "strand", "pos1", "pos2", "pos3")

# Desired behaviour
# y is sorted by key (as desired) and grp is added prior to sort (as desired)
y[, grp := .GRP, keyby = my_key]
y
#>     seqnames strand pos1 pos2 pos3 grp
#>  1:     chr1      *    5    6    7   5
#>  2:     chr1      +    6    7    8   6
#>  3:     chr1      -    1    2    3   1
#>  4:     chr2      *    4    5    6   4
#>  5:     chr2      +    2    3    4   2
#>  6:     chr2      +    3    4    5   3
#>  7:     chr3      +    7    8    9   7
#>  8:     chr3      +    8    9   10   8
#>  9:     chr3      -    9   10   11   9
#> 10:     chr3      -   10   11   12  10

sessionInfo()
#> R version 3.3.1 Patched (2016-07-14 r70916)
#> Platform: x86_64-apple-darwin13.4.0 (64-bit)
#> Running under: OS X El Capitan (10.11.6)
#> 
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.9.6
#> 
#> loaded via a namespace (and not attached):
#>  [1] backports_1.0.4 magrittr_1.5    rprojroot_1.1   tools_3.3.1    
#>  [5] htmltools_0.3.5 Rcpp_0.12.8     stringi_1.1.2   rmarkdown_1.2  
#>  [9] knitr_1.15.1    stringr_1.1.0   digest_0.6.10   chron_2.3-47   
#> [13] evaluate_0.10

Using data.table v1.9.8 (or v1.10.0)

library(data.table)
#> 
#> This data.table install has not detected OpenMP support. It will work but slower in single threaded mode.
# Test data
y <- data.table(seqnames = c("chr1", rep("chr2", 3), rep("chr1", 2), 
                             rep("chr3", 4)),
                strand = c("-", rep("+", 2), rep("*", 2), rep("+", 3), 
                           rep("-", 2)),
                pos1 = 1:10,
                pos2 = 2:11, 
                pos3 = 3:12)
y
#>     seqnames strand pos1 pos2 pos3
#>  1:     chr1      -    1    2    3
#>  2:     chr2      +    2    3    4
#>  3:     chr2      +    3    4    5
#>  4:     chr2      *    4    5    6
#>  5:     chr1      *    5    6    7
#>  6:     chr1      +    6    7    8
#>  7:     chr3      +    7    8    9
#>  8:     chr3      +    8    9   10
#>  9:     chr3      -    9   10   11
#> 10:     chr3      -   10   11   12
yy <- copy(y)
my_key <- c("seqnames", "strand", "pos1", "pos2", "pos3")

# Undesired behaviour
# Changed behaviour in data.table v1.9.8
# y is sorted by key (as desired) but grp is added after sort (not desired)
y[, grp := .GRP, keyby = my_key]
y
#>     seqnames strand pos1 pos2 pos3 grp
#>  1:     chr1      *    5    6    7   1
#>  2:     chr1      +    6    7    8   2
#>  3:     chr1      -    1    2    3   3
#>  4:     chr2      *    4    5    6   4
#>  5:     chr2      +    2    3    4   5
#>  6:     chr2      +    3    4    5   6
#>  7:     chr3      +    7    8    9   7
#>  8:     chr3      +    8    9   10   8
#>  9:     chr3      -    9   10   11   9
#> 10:     chr3      -   10   11   12  10

# Desired behaviour
# y is sorted by key (as desired) and grp is added prior to sort (as desired)
y <- copy(yy)
y
#>     seqnames strand pos1 pos2 pos3
#>  1:     chr1      -    1    2    3
#>  2:     chr2      +    2    3    4
#>  3:     chr2      +    3    4    5
#>  4:     chr2      *    4    5    6
#>  5:     chr1      *    5    6    7
#>  6:     chr1      +    6    7    8
#>  7:     chr3      +    7    8    9
#>  8:     chr3      +    8    9   10
#>  9:     chr3      -    9   10   11
#> 10:     chr3      -   10   11   12
setkeyv(y[, grp := .GRP, by = my_key], my_key)
y
#>     seqnames strand pos1 pos2 pos3 grp
#>  1:     chr1      *    5    6    7   5
#>  2:     chr1      +    6    7    8   6
#>  3:     chr1      -    1    2    3   1
#>  4:     chr2      *    4    5    6   4
#>  5:     chr2      +    2    3    4   2
#>  6:     chr2      +    3    4    5   3
#>  7:     chr3      +    7    8    9   7
#>  8:     chr3      +    8    9   10   8
#>  9:     chr3      -    9   10   11   9
#> 10:     chr3      -   10   11   12  10

sessionInfo()
#> R version 3.3.1 Patched (2016-07-14 r70916)
#> Platform: x86_64-apple-darwin13.4.0 (64-bit)
#> Running under: OS X El Capitan (10.11.6)
#> 
#> locale:
#> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.9.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] backports_1.0.4 magrittr_1.5    rprojroot_1.1   tools_3.3.1    
#>  [5] htmltools_0.3.5 Rcpp_0.12.8     stringi_1.1.2   rmarkdown_1.2  
#>  [9] knitr_1.15.1    stringr_1.1.0   digest_0.6.10   evaluate_0.10
@mattdowle
Copy link
Member

mattdowle commented Dec 9, 2016

Yes you're right. keyby= now run the groups in the sorted order. The act of evaluating j for each group is what increments .GRP.

News item 75 was :

keyby= now runs j in the order that the groups appear in the sorted result rather than first appearance order, #606. This only makes a difference in very rare usage where j does something depending on an earlier group's result, perhaps by using <<-. If j is required to be run in first appearance order, then use by= whose behaviour is unchanged. Now we have this option. No existing tests affected. New tests added.

The documentation for .GRP was and still is ambiguous, currently :

.GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.

That's still true for the new behaviour for keyby=, since it's the 1st group that's evaluated and returned in the result.

I'm looking at your new y[, grp := .GRP, by = my_key] and thinking yes that's now clearer in your code that you do indeed intend .GRP to represent the order of first appearance of each group in y. But that's only clear to us who know that by= keeps appearance order. I wish it were keepby= to be clearer as I proposed and still would like, yes as per #1880.

Going forward, yes I think I like your :

    setkeyv(y[, grp := .GRP, by = my_key], my_key)

Although it's longer than before, it's clearer (well, given that we know by= keeps appearance order). I then tried to chain it instead and was surprised that sole keyby= does nothing silently, it seems :

    y[, grp := .GRP, by = my_key][ keyby=my_key ]   # keyby unhelpfully ignored

Adding .SD seems to be necessary :

    y[, grp := .GRP, by = my_key][, .SD, keyby=my_key ]   # works and same as yours

We should look at making sole keyby= work. Any views out there people following?

To do what you need in one step though, how about introducing a new special symbol .GRPI? Since .I refers to original row numbers of x for each group, the I on the end of .GRPI is similar to that in that it would refer to the order that the group first appeared.

?.GRP would change as follows :

.I is an integer vector equal to seq_len(nrow(x)). While grouping, it holds for each item in the group, its row number in x. This is useful to subset in j and return row numbers; e.g. rowNums = DT[, .I[which.max(somecol)], by=grp].

.GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc. as j is evaluated for each group.

.GRPI is a single integer like .GRP but only useful in combination with keyby=. It contains the order that the group first appeared in x before the groups were sorted.

Note that .I[1] holds the row number where the group first appeared in the dataset. For each group, this is the same whether you use by= or keyby=.

Note that .GRP==.GRPI when using by= but .GRP!=.GRPI when using keyby= (unless the groups appear in sorted order anyway).

Then your code would be a one letter change to :

    y[, grp := .GRPI, keyby = my_key]

What do you and people following think? Perhaps .GRPI should only be available with keyby=. If used with by= it would be an error.

@eantonya
Copy link
Contributor

+1 for lone keyby working. Wasn't there already a FR about it?
-1 for adding .GRPI. I don't think it has enough value compared with existing solutions, to introduce another dot-blah symbol.

@mattdowle
Copy link
Member

mattdowle commented Dec 10, 2016

I've seen something somewhere I think about lone keyby -- can't find it now. Ok good.
On adding things, many things in data.table are of no value at all to most people, but it's something that a few people (like @PeteHaitch) might value. It's not like dot-blah symbols are up front getting in newbies way. They're like there if you ever need them. The other solutions are slower as they have to do the grouping and sorting all over again.

@franknarf1
Copy link
Contributor

franknarf1 commented Dec 10, 2016

Seems fine, but the name's confusing. Based on the name, I would expect .GRPI to be row counter within a group, that is 1:.N. My style would be the name .GRP0 for starting values.

This seems related to #1494 re having .GRP counter invariant to subsetting i in DT[i, .GRP, by].

Edit: Oh, and lone keyby is nice.

@mattdowle
Copy link
Member

mattdowle commented Dec 10, 2016

Good call, both on name and #1494.
On name, how about .GRPO rather than .GRP0. O for order. .I[1] already holds the row number on which the group first appears, although I'll have to go through #1494 again to check. .GRP0 feels a bit like .GRP[0] or .GRP[1] to me.

@jangorecki
Copy link
Member

lone keyby in #1105.
I prefer .GRPI over .GRPO or 0.

@PeteHaitch
Copy link
Author

Thanks for the careful explanation, Matt. I now properly understand what news item 75 means. I am happy to use setkeyv(y[, grp := .GRP, by = my_key], my_key) in my package code. I don't mind whether there is a new explicit symbol such as GRPO for this behaviour since my use case is internal package code and not user-facing code. I've no preference on the name of the new symbol and defer to you and the rest of the data.table developers.

Thanks for all your great work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants