group_by performance: potential for easy and substantial improvement #4406

brodieG · 2019-06-05T13:16:55Z

I've been studying and writing about grouped operations over the past couple of
months, and one thing that has become apparent to me is that ordering the data
can in at least some circumstances provide a massive performance improvement.
This is primarily due to data.table sharing their radix sort with base R in
3.3.0 which makes it much faster than it used to be.

I think dplyr could take advantage of this pretty easily as evidenced by:

# R3.6.0 w/ dplyr 0.8.1

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

system.time(a <- tib %>% group_by(grp) %>% summarise(sum(x)))
##   user  system elapsed 
## 10.369   0.311  10.793 
system.time({
  o <- order(grp)
  tibo <- tibble(grp=grp[o], x=x[o])
  b <- tibo %>% group_by(grp) %>% summarise(sum(x))
})
##   user  system elapsed 
##  3.120   0.328   3.474 
all.equal(a, b)
## [1] TRUE

The slow step is group_by.

I'm guessing much of the dplyr code was written prior to the data.table radix
sort becoming part of order, so I imagine that may have guided some of the
original design decisions. It seems now that it is an easy win to add a
pre-order to dplyr. There is a penalty for cases where the data is already
ordered, but it is small, and as written above there is some additional memory
usage.

I have not tested this thoroughly, so your mileage may vary across input types.
However the benefit is substantial enough in some cases that it may be worth
spending some time exploring the broader applicability of the change.

I am not familiar with C++, so I have not dug into the sources to narrow down
the problem, but I'm assuming it's a manifestation of similar microarchitectural
factors that affect split that are discussed in this blog post.

This does not appear related to #4334 AFAICT from looking at memory usage via gc():

> library(tibble)
> x <- runif(1e7)
> grp <- sample(1e6, 1e7, replace=TRUE)
> tib <- tibble(grp, x)
> gc()
           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells   491743  26.3     939424  50.2         NA   939424  50.2
Vcells 15868290 121.1   25428491 194.1      16384 15915118 121.5
> system.time(a <- tib %>% group_by(grp) %>% summarise(sum(x)))
   user  system elapsed 
 11.301   0.404  11.866 
> gc()
           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells   496226  26.6    2478724 132.4         NA  1697148  90.7
Vcells 17377752 132.6   36793026 280.8      16384 25214184 192.4
> system.time({
+   o <- order(grp)
+   tibo <- tibble(grp=grp[o], x=x[o])
+   b <- tibo %>% group_by(grp) %>% summarise(sum(x))
+ })
   user  system elapsed 
  3.472   0.270   3.777 
> gc()
           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells   496253  26.6    1554928  83.1         NA  1943660 103.9
Vcells 38877692 296.7   67779293 517.2      16384 46702085 356.4
> object.size(tib)
120000984 bytes

The text was updated successfully, but these errors were encountered:

romainfrancois · 2019-06-07T07:43:21Z

group_by() is due work to rebase it on vctrs in the 0.9.* series. This should take care of this.

hadley · 2020-06-29T16:29:50Z

In the short term, I suggest we expose the new vec_order() (now implemented with the same radix sort that data.table and order() use), through something like group_by_ordered() (deliberately long since we'll want to change it in the future). That'll help us to start to explore where ALTREP representations of indexing etc will help.

romainfrancois · 2021-03-09T12:47:03Z

The difference is less dramatic now.

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

bench::mark(
  not_ordered = tib %>% group_by(grp), 
  ordered = {
    o <- order(grp)
    tibo <- tibble(grp=grp[o], x=x[o])
    b <- tibo %>% group_by(grp)
  }, 
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_ordered    2.69s    2.69s     0.372     248MB     1.12
#> 2 ordered        1.37s    1.37s     0.729     400MB     1.46

^{Created on 2021-03-09 by the reprex package (v0.3.0)}

From a quick profvis, it seems that vec_group_loc() really benefits from the data being ordered:

(not ordered):

(ordered):

I'm not sure which loop of SEXP vec_group_loc(SEXP x) { benefits the most from the ordering.

romainfrancois · 2021-03-09T13:15:24Z

If I casually replace vec_split_id_order() by this:

vec_split_id_order <- function(x) {
  o <- vec_order(x)
  xo <- vec_slice(x, o)
  split_id <- vec_group_loc(xo)
  split_id$loc <- map(split_id$loc, function(.x) o[.x])
  split_id$loc <- new_list_of(split_id$loc, ptype = integer())

  vec_slice(split_id, vec_order(split_id$key))
}

it appears to be better on that example:

with the map() call expensive, but presumably it's a loop that could be done internally, we could imagine a version of SEXP vec_group_loc(SEXP x) that would take o as a second parameter, and maybe further take advantage of the fact that x is already ordered ? cc @DavisVaughan

DavisVaughan · 2021-03-09T13:30:02Z

I suspect we will eventually want to replace vec_split_id_order() with something like the internal vctrs:::vec_order_locs(), which I designed exactly for this purpose. If I'm remembering right, a single vctrs:::vec_order_radix() call gives us everything we need to compute ordered groups, which vec_order_locs() takes advantage of.

It does order character vectors in the C locale, but I am unsure how much that matters for grouping purposes (more important for arrange()).

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

dplyr_vec_order_locs <- dplyr:::vec_split_id_order
vctrs_vec_order_locs <- vctrs:::vec_order_locs

bench::mark(
  dplyr = {
    tmp <- dplyr_vec_order_locs(tib$grp)
    attributes(tmp$loc) <- NULL
    tmp
  },
  vctrs = vctrs_vec_order_locs(tib$grp),
  iterations = 20
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr         1.85s    1.85s     0.539     224MB    15.6 
#> 2 vctrs      214.99ms 217.44ms     4.45      189MB     5.44

^{Created on 2021-03-09 by the reprex package (v1.0.0)}

romainfrancois · 2021-03-09T14:58:01Z

Nice, do you think it's worth experimenting using vctrs:::vec_order_locs() now ?

romainfrancois added this to the 0.9.0 milestone Jun 7, 2019

hadley added grouping 👨‍👩‍👧‍👦 performance 🚀 labels Dec 11, 2019

jangorecki mentioned this issue Feb 2, 2020

Is setkey useful for tasks other than subsetting as well? Rdatatable/data.table#4224

Closed

hadley removed this from the 1.0.0 milestone Mar 1, 2020

romainfrancois mentioned this issue Mar 10, 2021

use vctrs:::vec_order_locs() in group_by() and vctrs:::vec_order_radix() in arrange() #5808

Closed

DavisVaughan mentioned this issue Jul 29, 2021

Feedback on tidyup 3: Radix Ordering in dplyr::arrange() tidyverse/tidyups#11

Closed

DavisVaughan mentioned this issue Sep 16, 2021

Update group_by() algorithm to utilize vec_locate_sorted_groups() #6018

Closed

DavisVaughan mentioned this issue Jun 14, 2022

Update group_by() algorithm to utilize vec_locate_sorted_groups() #6297

Merged

DavisVaughan closed this as completed in #6297 Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group_by performance: potential for easy and substantial improvement #4406

group_by performance: potential for easy and substantial improvement #4406

brodieG commented Jun 5, 2019

romainfrancois commented Jun 7, 2019

hadley commented Jun 29, 2020

romainfrancois commented Mar 9, 2021

romainfrancois commented Mar 9, 2021

DavisVaughan commented Mar 9, 2021

romainfrancois commented Mar 9, 2021

group_by performance: potential for easy and substantial improvement #4406

group_by performance: potential for easy and substantial improvement #4406

Comments

brodieG commented Jun 5, 2019

romainfrancois commented Jun 7, 2019

hadley commented Jun 29, 2020

romainfrancois commented Mar 9, 2021

romainfrancois commented Mar 9, 2021

DavisVaughan commented Mar 9, 2021

romainfrancois commented Mar 9, 2021