How do you use capture_melt_multiple in this example? #18

skanskan · 2020-11-03T17:12:29Z

I need to convert from wide format to long format large datasets like this one I show in a toy example:

nn <- 2
nid <- 5
namdat <- CJ(letters[1:nn], paste0(rep(letters[1:nn],1:nn),"_20",sprintf("%02d",sequence(1:nn))) )[,paste0(V1,V2)]
namsip <- CJ(letters[1:nn],letters[1:nn])[,paste0("Z",V1,V2)]
DT <- data.table(ID=1:nid)[,c(namdat) := .(runif(nid))][,c(namsip) := .(sample(letters, nid, replace = T))]

   ID    aa_2001    ab_2001    ab_2002    ba_2001    bb_2001    bb_2002 Zaa Zab Zba Zbb
1:  1 0.18852270 0.18852270 0.18852270 0.18852270 0.18852270 0.18852270   c   c   c   c
2:  2 0.44315259 0.44315259 0.44315259 0.44315259 0.44315259 0.44315259   z   z   z   z
3:  3 0.01910837 0.01910837 0.01910837 0.01910837 0.01910837 0.01910837   d   d   d   d
4:  4 0.73697807 0.73697807 0.73697807 0.73697807 0.73697807 0.73697807   e   e   e   e
5:  5 0.07062321 0.07062321 0.07062321 0.07062321 0.07062321 0.07062321   w   w   w   w

If I want to do it with melt and dcast I would use this:

idvars <- grep("_20[0-9][0-9]$",names(DT) , invert = TRUE, value=T)   
temp <- melt(DT, id.vars = idvars)                                  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable),  yyear = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )

    ID Zaa Zab Zba Zbb yyear         aa         ab         ba         bb
 1:  1   c   c   c   c  2001 0.18852270 0.18852270 0.18852270 0.18852270
 2:  1   c   c   c   c  2002         NA 0.18852270         NA 0.18852270
 3:  2   z   z   z   z  2001 0.44315259 0.44315259 0.44315259 0.44315259
 4:  2   z   z   z   z  2002         NA 0.44315259         NA 0.44315259
 5:  3   d   d   d   d  2001 0.01910837 0.01910837 0.01910837 0.01910837
 6:  3   d   d   d   d  2002         NA 0.01910837         NA 0.01910837
 7:  4   e   e   e   e  2001 0.73697807 0.73697807 0.73697807 0.73697807
 8:  4   e   e   e   e  2002         NA 0.73697807         NA 0.73697807
 9:  5   w   w   w   w  2001 0.07062321 0.07062321 0.07062321 0.07062321
10:  5   w   w   w   w  2002         NA 0.07062321         NA 0.07062321

But I can't find the proper syntax with capture_melt_multiple.
I have tried different options such as capture_melt_multiple(DT, column=".*","_", dim=".*", fill=TRUE) and also using field. But I'm always getting errors.

My intention later is to benchmark this tool in order to find the fastest way to convert wide to long for large nn, for example nn=25, and nid=10000.

The text was updated successfully, but these errors were encountered:

tdhock · 2020-11-17T03:27:35Z

Actually the syntax is good but "missing input columns" is not yet supported, I'm waiting for data table to merge my PR, see #12
If you want to try the new/experimental code:

> remotes::install_github(c("Rdatatable/data.table@fix4027", "tdhock/nc@multiple-fill"))

Skipping install of 'data.table' from a github remote, the SHA1 (854f1897) has not changed since last install.
  Use `force = TRUE` to force installation
Skipping install of 'nc' from a github remote, the SHA1 (bbc49eca) has not changed since last install.
  Use `force = TRUE` to force installation
> nc::capture_melt_multiple(DT, column=".*","_", dim=".*", fill=TRUE)
    ID Zaa Zab Zba Zbb  dim        aa        ab        ba        bb
 1:  1   z   z   z   z 2001 0.7256901 0.7256901 0.7256901 0.7256901
 2:  2   p   p   p   p 2001 0.4719875 0.4719875 0.4719875 0.4719875
 3:  3   b   b   b   b 2001 0.4384963 0.4384963 0.4384963 0.4384963
 4:  4   u   u   u   u 2001 0.9501824 0.9501824 0.9501824 0.9501824
 5:  5   o   o   o   o 2001 0.3648079 0.3648079 0.3648079 0.3648079
 6:  1   z   z   z   z 2002        NA 0.7256901        NA 0.7256901
 7:  2   p   p   p   p 2002        NA 0.4719875        NA 0.4719875
 8:  3   b   b   b   b 2002        NA 0.4384963        NA 0.4384963
 9:  4   u   u   u   u 2002        NA 0.9501824        NA 0.9501824
10:  5   o   o   o   o 2002        NA 0.3648079        NA 0.3648079
> nc::capture_melt_multiple(DT, column=".*","_", dim=".*", fill=FALSE)
Error in (function (subject.names, match.dt, no.match, fill = TRUE)  : 
  need dim=same count for each value, but have: 2001=4 2002=2; please change pattern, edit input column names, or use fill=TRUE to output missing values

tdhock · 2020-11-17T03:29:14Z

Also I did some benchmarks as well, varying the number of rows/columns, see Figures 4 and 5 in https://raw.githubusercontent.com/tdhock/nc-article/master/RJwrapper.pdf
Any other comments/feedback about that paper (currently in peer review) would be appreciated as well.
See also Rdatatable/data.table#4731 (comment) for some figures comparing computation time of various data table methods.

tdhock closed this as completed Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you use capture_melt_multiple in this example? #18

How do you use capture_melt_multiple in this example? #18

skanskan commented Nov 3, 2020 •

edited

Loading

tdhock commented Nov 17, 2020

tdhock commented Nov 17, 2020 •

edited

Loading

How do you use capture_melt_multiple in this example? #18

How do you use capture_melt_multiple in this example? #18

Comments

skanskan commented Nov 3, 2020 • edited Loading

tdhock commented Nov 17, 2020

tdhock commented Nov 17, 2020 • edited Loading

skanskan commented Nov 3, 2020 •

edited

Loading

tdhock commented Nov 17, 2020 •

edited

Loading