-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for a function to handle many small groups in some situations #4284
Comments
frank is probably slow for many groups because of parallelism, do you get same timings when using single thread? |
Similar to #3739 |
This is a neat idea although it begs the question: does set assignment need equivalent to GForce? Maybe SForce. Here's a similar version for
I am on the development version with 1.12.9 and it has a hit in performance for groupings like this. But using the internal
The current implementation of |
@jangorecki, as requested, here are the timings with
@shrektan for
timings:
@ColeMiller1, we can order quickly using |
@chinsoon12 I have a hackish method working - right now it only works without assignment and uses Rcpp. I will work on getting it more robust and try to do a PR. My C is not that good, though... library(data.table) #Dev Version
set.seed(0L)
nr <- 1e6L#8
DT <- data.table(ID=rep(1L:(nr/2L), each=4L), VAL=1L:nr)
laggrpv <- function(g, v, N) {
ix <- which(diff(g)!=0L) + 1L
f <- replace(rep(NA_integer_, N), ix, v[ix-1L])
k <- nafill(nafill(f, "locf"), fill=0)
}
mtd1 <- function() {
DT[, .(CS = {
cs <- cumsum(as.double(VAL))
k <- laggrpv(ID, cs, .N)
cs - k
})]
}
bench::mark(
DT[, cumsum(VAL), ID], ##using hack method
mtd1(),
DT[, (cumsum(VAL)), ID], ##hack way doesn't see past ()
check = FALSE
)
### A tibble: 3 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
##1 DT[, cumsum(VAL), ID] 45.3ms 53.5ms 19.0 28.6MB
##2 mtd1() 97.2ms 123.8ms 8.34 200.3MB
##3 DT[, (cumsum(VAL)), ID] 435.8ms 449.1ms 2.23 26.7MB
all.equal(DT[, cumsum(VAL), ID]$V1, mtd1()[[1L]])
##[1] TRUE |
When using
by
with many small groups, runtime can be slow. For some situations likecumsum
andfrank
, we could perform it on the whole vector first and use the last value from the previous group to thus avoiding the use ofby
. Hence, proposing to add a function that carries the value from previous group to the next group (similar in spirit torleid
androwid
). I suspect it might help with some recursive algo.R implementation:
Timing code for
cumsum
:timings:
Timing code for
frank(..., ties.method="dense")
timing:
The text was updated successfully, but these errors were encountered: