Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by performance: potential for easy and substantial improvement #4406

Closed
brodieG opened this issue Jun 5, 2019 · 6 comments · Fixed by #6297
Closed

group_by performance: potential for easy and substantial improvement #4406

brodieG opened this issue Jun 5, 2019 · 6 comments · Fixed by #6297

Comments

@brodieG
Copy link

brodieG commented Jun 5, 2019

I've been studying and writing about grouped operations over the past couple of
months, and one thing that has become apparent to me is that ordering the data
can in at least some circumstances provide a massive performance improvement.
This is primarily due to data.table sharing their radix sort with base R in
3.3.0 which makes it much faster than it used to be.

I think dplyr could take advantage of this pretty easily as evidenced by:

# R3.6.0 w/ dplyr 0.8.1

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

system.time(a <- tib %>% group_by(grp) %>% summarise(sum(x)))
##   user  system elapsed 
## 10.369   0.311  10.793 
system.time({
  o <- order(grp)
  tibo <- tibble(grp=grp[o], x=x[o])
  b <- tibo %>% group_by(grp) %>% summarise(sum(x))
})
##   user  system elapsed 
##  3.120   0.328   3.474 
all.equal(a, b)
## [1] TRUE

The slow step is group_by.

I'm guessing much of the dplyr code was written prior to the data.table radix
sort becoming part of order, so I imagine that may have guided some of the
original design decisions. It seems now that it is an easy win to add a
pre-order to dplyr. There is a penalty for cases where the data is already
ordered, but it is small, and as written above there is some additional memory
usage.

I have not tested this thoroughly, so your mileage may vary across input types.
However the benefit is substantial enough in some cases that it may be worth
spending some time exploring the broader applicability of the change.

I am not familiar with C++, so I have not dug into the sources to narrow down
the problem, but I'm assuming it's a manifestation of similar microarchitectural
factors that affect split that are discussed in this blog post.

This does not appear related to #4334 AFAICT from looking at memory usage via gc():

> library(tibble)
> x <- runif(1e7)
> grp <- sample(1e6, 1e7, replace=TRUE)
> tib <- tibble(grp, x)
> gc()
           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells   491743  26.3     939424  50.2         NA   939424  50.2
Vcells 15868290 121.1   25428491 194.1      16384 15915118 121.5
> system.time(a <- tib %>% group_by(grp) %>% summarise(sum(x)))
   user  system elapsed 
 11.301   0.404  11.866 
> gc()
           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells   496226  26.6    2478724 132.4         NA  1697148  90.7
Vcells 17377752 132.6   36793026 280.8      16384 25214184 192.4
> system.time({
+   o <- order(grp)
+   tibo <- tibble(grp=grp[o], x=x[o])
+   b <- tibo %>% group_by(grp) %>% summarise(sum(x))
+ })
   user  system elapsed 
  3.472   0.270   3.777 
> gc()
           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells   496253  26.6    1554928  83.1         NA  1943660 103.9
Vcells 38877692 296.7   67779293 517.2      16384 46702085 356.4
> object.size(tib)
120000984 bytes
@romainfrancois romainfrancois added this to the 0.9.0 milestone Jun 7, 2019
@romainfrancois
Copy link
Member

group_by() is due work to rebase it on vctrs in the 0.9.* series. This should take care of this.

@hadley
Copy link
Member

hadley commented Jun 29, 2020

In the short term, I suggest we expose the new vec_order() (now implemented with the same radix sort that data.table and order() use), through something like group_by_ordered() (deliberately long since we'll want to change it in the future). That'll help us to start to explore where ALTREP representations of indexing etc will help.

@romainfrancois
Copy link
Member

The difference is less dramatic now.

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

bench::mark(
  not_ordered = tib %>% group_by(grp), 
  ordered = {
    o <- order(grp)
    tibo <- tibble(grp=grp[o], x=x[o])
    b <- tibo %>% group_by(grp)
  }, 
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_ordered    2.69s    2.69s     0.372     248MB     1.12
#> 2 ordered        1.37s    1.37s     0.729     400MB     1.46

Created on 2021-03-09 by the reprex package (v0.3.0)

From a quick profvis, it seems that vec_group_loc() really benefits from the data being ordered:

(not ordered):

image

(ordered):

image

I'm not sure which loop of SEXP vec_group_loc(SEXP x) { benefits the most from the ordering.

@romainfrancois
Copy link
Member

If I casually replace vec_split_id_order() by this:

vec_split_id_order <- function(x) {
  o <- vec_order(x)
  xo <- vec_slice(x, o)
  split_id <- vec_group_loc(xo)
  split_id$loc <- map(split_id$loc, function(.x) o[.x])
  split_id$loc <- new_list_of(split_id$loc, ptype = integer())

  vec_slice(split_id, vec_order(split_id$key))
}

it appears to be better on that example:

image

with the map() call expensive, but presumably it's a loop that could be done internally, we could imagine a version of SEXP vec_group_loc(SEXP x) that would take o as a second parameter, and maybe further take advantage of the fact that x is already ordered ? cc @DavisVaughan

@DavisVaughan
Copy link
Member

I suspect we will eventually want to replace vec_split_id_order() with something like the internal vctrs:::vec_order_locs(), which I designed exactly for this purpose. If I'm remembering right, a single vctrs:::vec_order_radix() call gives us everything we need to compute ordered groups, which vec_order_locs() takes advantage of.

It does order character vectors in the C locale, but I am unsure how much that matters for grouping purposes (more important for arrange()).

set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)

dplyr_vec_order_locs <- dplyr:::vec_split_id_order
vctrs_vec_order_locs <- vctrs:::vec_order_locs

bench::mark(
  dplyr = {
    tmp <- dplyr_vec_order_locs(tib$grp)
    attributes(tmp$loc) <- NULL
    tmp
  },
  vctrs = vctrs_vec_order_locs(tib$grp),
  iterations = 20
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dplyr         1.85s    1.85s     0.539     224MB    15.6 
#> 2 vctrs      214.99ms 217.44ms     4.45      189MB     5.44

Created on 2021-03-09 by the reprex package (v1.0.0)

@romainfrancois
Copy link
Member

Nice, do you think it's worth experimenting using vctrs:::vec_order_locs() now ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment