-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group_by performance: potential for easy and substantial improvement #4406
Comments
|
In the short term, I suggest we expose the new |
The difference is less dramatic now. set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)
bench::mark(
not_ordered = tib %>% group_by(grp),
ordered = {
o <- order(grp)
tibo <- tibble(grp=grp[o], x=x[o])
b <- tibo %>% group_by(grp)
},
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 not_ordered 2.69s 2.69s 0.372 248MB 1.12
#> 2 ordered 1.37s 1.37s 0.729 400MB 1.46 Created on 2021-03-09 by the reprex package (v0.3.0) From a quick profvis, it seems that (not ordered): (ordered): I'm not sure which loop of |
If I casually replace vec_split_id_order <- function(x) {
o <- vec_order(x)
xo <- vec_slice(x, o)
split_id <- vec_group_loc(xo)
split_id$loc <- map(split_id$loc, function(.x) o[.x])
split_id$loc <- new_list_of(split_id$loc, ptype = integer())
vec_slice(split_id, vec_order(split_id$key))
} it appears to be better on that example: with the |
I suspect we will eventually want to replace It does order character vectors in the C locale, but I am unsure how much that matters for grouping purposes (more important for set.seed(42)
library(dplyr)
library(tibble)
x <- runif(1e7)
grp <- sample(1e6, 1e7, replace=TRUE)
tib <- tibble(grp, x)
dplyr_vec_order_locs <- dplyr:::vec_split_id_order
vctrs_vec_order_locs <- vctrs:::vec_order_locs
bench::mark(
dplyr = {
tmp <- dplyr_vec_order_locs(tib$grp)
attributes(tmp$loc) <- NULL
tmp
},
vctrs = vctrs_vec_order_locs(tib$grp),
iterations = 20
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 1.85s 1.85s 0.539 224MB 15.6
#> 2 vctrs 214.99ms 217.44ms 4.45 189MB 5.44 Created on 2021-03-09 by the reprex package (v1.0.0) |
Nice, do you think it's worth experimenting using |
I've been studying and writing about grouped operations over the past couple of
months, and one thing that has become apparent to me is that ordering the data
can in at least some circumstances provide a massive performance improvement.
This is primarily due to
data.table
sharing their radix sort with base R in3.3.0 which makes it much faster than it used to be.
I think
dplyr
could take advantage of this pretty easily as evidenced by:The slow step is
group_by
.I'm guessing much of the dplyr code was written prior to the
data.table
radixsort becoming part of
order
, so I imagine that may have guided some of theoriginal design decisions. It seems now that it is an easy win to add a
pre-order to
dplyr
. There is a penalty for cases where the data is alreadyordered, but it is small, and as written above there is some additional memory
usage.
I have not tested this thoroughly, so your mileage may vary across input types.
However the benefit is substantial enough in some cases that it may be worth
spending some time exploring the broader applicability of the change.
I am not familiar with C++, so I have not dug into the sources to narrow down
the problem, but I'm assuming it's a manifestation of similar microarchitectural
factors that affect
split
that are discussed in this blog post.This does not appear related to #4334 AFAICT from looking at memory usage via
gc()
:The text was updated successfully, but these errors were encountered: