split.data.table method #1389

arunsrinivasan · 2015-10-11T18:57:04Z

Need a split.data.table method.

split by factor the same as split.data.frame - f argument.
split by reference to column names as character vector - by argument.
allow to produce empty elements in list, same as split.data.frame for drop=FALSE
allow to keep or drop field on which we split - drop argument.
support recursive split into nested lists for length(by) > 2L.
support keyby argument returning sorted but unkeyed list(s), but data.tables on leafs would be keyed, unless drop=TRUE.
update this SO post after.

@arunsrinivasan I've edited your post, feel free to correct/extend that list.

The text was updated successfully, but these errors were encountered:

jangorecki · 2015-12-11T16:02:27Z

You was asking about the use cases in the linked SO, so one which seems to be reasonable is to have data grouping in list for easier sending into workers, like mclapply, or lapply redirecting to Rserve processes. Below my proposal, much more flexible than data.frame split method, by default makes recursive nested lists. Try below, feedback welcome, if you are happy with it I can prepare PR.

library(data.table)
# x data.table
# f use `by` argument instead - unlike data.frame
# drop logical default FALSE will include `by` columns in resulting data.tables - unlike data.frame
# by character column names on which split into lists
# flatten logical default FALSE will result in recursive nested list having data.table as leafs
# ... ignored
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
    if(missing(by) && !missing(f)) by = f
    stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x), !"nm" %in% by)
    if(!flatten){
        .by = by[1L]
        tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
        setattr(ll <- tmp$.ll, "names", tmp[[.by]])
        if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
    } else {
        tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
        setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
        return(ll)
    }
}

set.seed(123)
dt = data.table(x1 = rep(letters[1:2], 6), x2 = rep(letters[3:5], 4), x3 = rep(letters[5:8], 3), y = rnorm(12))

str(split.data.table(dt, by = "x1"))
str(split.data.table(dt, by = "x1", drop = TRUE))
str(split.data.table(dt, by = "x1", flatten = TRUE)) # doesn't change anything when length(by)==1L

str(split.data.table(dt, by = "x2"))
str(split.data.table(dt, by = "x2", drop = TRUE))

str(split.data.table(dt, by = "x3"))
str(split.data.table(dt, by = "x3", drop = TRUE))

str(split.data.table(dt, by = c("x1","x2")))
str(split.data.table(dt, by = c("x1","x2"), drop = TRUE))
str(split.data.table(dt, by = c("x1","x2"), flatten = TRUE))
str(split.data.table(dt, by = c("x1","x2"), drop = TRUE, flatten = TRUE))

str(split.data.table(dt, by = c("x1","x2","x3")))
str(split.data.table(dt, by = c("x1","x2","x3"), drop = TRUE))
str(split.data.table(dt, by = c("x1","x2","x3"), flatten = TRUE))
str(split.data.table(dt, by = c("x1","x2","x3"), drop = TRUE, flatten = TRUE))

jangorecki · 2016-03-19T15:07:38Z

One of the unit tests 1639.138 was not as tight as it should, minor thing, fixed in 9d2d710

arunsrinivasan added the feature request label Oct 11, 2015

arunsrinivasan added this to the v1.9.8 milestone Oct 11, 2015

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Nov 2, 2015

jangorecki mentioned this issue Dec 29, 2015

R crash after split function on data.table #1481

Closed

arunsrinivasan modified the milestones: v1.9.8, v2.0.0 Jan 19, 2016

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 4, 2016

jangorecki self-assigned this Mar 10, 2016

jangorecki mentioned this issue Mar 11, 2016

[.data.table accept both by and keyby non-missing #1104

Closed

jangorecki closed this as completed in 5f7a435 Mar 19, 2016

mattdowle modified the milestones: v1.9.8, v1.9.10 Nov 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split.data.table method #1389

split.data.table method #1389

arunsrinivasan commented Oct 11, 2015

jangorecki commented Dec 11, 2015

jangorecki commented Mar 19, 2016

split.data.table method #1389

split.data.table method #1389

Comments

arunsrinivasan commented Oct 11, 2015

jangorecki commented Dec 11, 2015

jangorecki commented Mar 19, 2016