Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split.data.table method #1389

Closed
7 tasks done
arunsrinivasan opened this issue Oct 11, 2015 · 2 comments
Closed
7 tasks done

split.data.table method #1389

arunsrinivasan opened this issue Oct 11, 2015 · 2 comments
Assignees
Milestone

Comments

@arunsrinivasan
Copy link
Member

Need a split.data.table method.

  • split by factor the same as split.data.frame - f argument.
  • split by reference to column names as character vector - by argument.
  • allow to produce empty elements in list, same as split.data.frame for drop=FALSE
  • allow to keep or drop field on which we split - drop argument.
  • support recursive split into nested lists for length(by) > 2L.
  • support keyby argument returning sorted but unkeyed list(s), but data.tables on leafs would be keyed, unless drop=TRUE.
  • update this SO post after.

@arunsrinivasan I've edited your post, feel free to correct/extend that list.

@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Oct 11, 2015
@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Nov 2, 2015
@jangorecki
Copy link
Member

You was asking about the use cases in the linked SO, so one which seems to be reasonable is to have data grouping in list for easier sending into workers, like mclapply, or lapply redirecting to Rserve processes. Below my proposal, much more flexible than data.frame split method, by default makes recursive nested lists. Try below, feedback welcome, if you are happy with it I can prepare PR.

library(data.table)
# x data.table
# f use `by` argument instead - unlike data.frame
# drop logical default FALSE will include `by` columns in resulting data.tables - unlike data.frame
# by character column names on which split into lists
# flatten logical default FALSE will result in recursive nested list having data.table as leafs
# ... ignored
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
    if(missing(by) && !missing(f)) by = f
    stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x), !"nm" %in% by)
    if(!flatten){
        .by = by[1L]
        tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
        setattr(ll <- tmp$.ll, "names", tmp[[.by]])
        if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
    } else {
        tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
        setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
        return(ll)
    }
}

set.seed(123)
dt = data.table(x1 = rep(letters[1:2], 6), x2 = rep(letters[3:5], 4), x3 = rep(letters[5:8], 3), y = rnorm(12))

str(split.data.table(dt, by = "x1"))
str(split.data.table(dt, by = "x1", drop = TRUE))
str(split.data.table(dt, by = "x1", flatten = TRUE)) # doesn't change anything when length(by)==1L

str(split.data.table(dt, by = "x2"))
str(split.data.table(dt, by = "x2", drop = TRUE))

str(split.data.table(dt, by = "x3"))
str(split.data.table(dt, by = "x3", drop = TRUE))

str(split.data.table(dt, by = c("x1","x2")))
str(split.data.table(dt, by = c("x1","x2"), drop = TRUE))
str(split.data.table(dt, by = c("x1","x2"), flatten = TRUE))
str(split.data.table(dt, by = c("x1","x2"), drop = TRUE, flatten = TRUE))

str(split.data.table(dt, by = c("x1","x2","x3")))
str(split.data.table(dt, by = c("x1","x2","x3"), drop = TRUE))
str(split.data.table(dt, by = c("x1","x2","x3"), flatten = TRUE))
str(split.data.table(dt, by = c("x1","x2","x3"), drop = TRUE, flatten = TRUE))

@jangorecki
Copy link
Member

One of the unit tests 1639.138 was not as tight as it should, minor thing, fixed in 9d2d710

@mattdowle mattdowle modified the milestones: v1.9.8, v1.9.10 Nov 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants