-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split.data.table method #1389
Comments
You was asking about the use cases in the linked SO, so one which seems to be reasonable is to have data grouping in list for easier sending into workers, like mclapply, or lapply redirecting to Rserve processes. Below my proposal, much more flexible than data.frame split method, by default makes recursive nested lists. Try below, feedback welcome, if you are happy with it I can prepare PR. library(data.table)
# x data.table
# f use `by` argument instead - unlike data.frame
# drop logical default FALSE will include `by` columns in resulting data.tables - unlike data.frame
# by character column names on which split into lists
# flatten logical default FALSE will result in recursive nested list having data.table as leafs
# ... ignored
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
if(missing(by) && !missing(f)) by = f
stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x), !"nm" %in% by)
if(!flatten){
.by = by[1L]
tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
setattr(ll <- tmp$.ll, "names", tmp[[.by]])
if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
} else {
tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
return(ll)
}
}
set.seed(123)
dt = data.table(x1 = rep(letters[1:2], 6), x2 = rep(letters[3:5], 4), x3 = rep(letters[5:8], 3), y = rnorm(12))
str(split.data.table(dt, by = "x1"))
str(split.data.table(dt, by = "x1", drop = TRUE))
str(split.data.table(dt, by = "x1", flatten = TRUE)) # doesn't change anything when length(by)==1L
str(split.data.table(dt, by = "x2"))
str(split.data.table(dt, by = "x2", drop = TRUE))
str(split.data.table(dt, by = "x3"))
str(split.data.table(dt, by = "x3", drop = TRUE))
str(split.data.table(dt, by = c("x1","x2")))
str(split.data.table(dt, by = c("x1","x2"), drop = TRUE))
str(split.data.table(dt, by = c("x1","x2"), flatten = TRUE))
str(split.data.table(dt, by = c("x1","x2"), drop = TRUE, flatten = TRUE))
str(split.data.table(dt, by = c("x1","x2","x3")))
str(split.data.table(dt, by = c("x1","x2","x3"), drop = TRUE))
str(split.data.table(dt, by = c("x1","x2","x3"), flatten = TRUE))
str(split.data.table(dt, by = c("x1","x2","x3"), drop = TRUE, flatten = TRUE)) |
One of the unit tests 1639.138 was not as tight as it should, minor thing, fixed in 9d2d710 |
Need a
split.data.table
method.split.data.frame
-f
argument.by
argument.split.data.frame
fordrop=FALSE
drop
argument.length(by) > 2L
.keyby
argument returning sorted but unkeyed list(s), but data.tables on leafs would be keyed, unlessdrop=TRUE
.@arunsrinivasan I've edited your post, feel free to correct/extend that list.
The text was updated successfully, but these errors were encountered: