Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you use capture_melt_multiple in this example? #18

Closed
skanskan opened this issue Nov 3, 2020 · 2 comments
Closed

How do you use capture_melt_multiple in this example? #18

skanskan opened this issue Nov 3, 2020 · 2 comments

Comments

@skanskan
Copy link

skanskan commented Nov 3, 2020

I need to convert from wide format to long format large datasets like this one I show in a toy example:

nn <- 2
nid <- 5
namdat <- CJ(letters[1:nn], paste0(rep(letters[1:nn],1:nn),"_20",sprintf("%02d",sequence(1:nn))) )[,paste0(V1,V2)]
namsip <- CJ(letters[1:nn],letters[1:nn])[,paste0("Z",V1,V2)]
DT <- data.table(ID=1:nid)[,c(namdat) := .(runif(nid))][,c(namsip) := .(sample(letters, nid, replace = T))]
   ID    aa_2001    ab_2001    ab_2002    ba_2001    bb_2001    bb_2002 Zaa Zab Zba Zbb
1:  1 0.18852270 0.18852270 0.18852270 0.18852270 0.18852270 0.18852270   c   c   c   c
2:  2 0.44315259 0.44315259 0.44315259 0.44315259 0.44315259 0.44315259   z   z   z   z
3:  3 0.01910837 0.01910837 0.01910837 0.01910837 0.01910837 0.01910837   d   d   d   d
4:  4 0.73697807 0.73697807 0.73697807 0.73697807 0.73697807 0.73697807   e   e   e   e
5:  5 0.07062321 0.07062321 0.07062321 0.07062321 0.07062321 0.07062321   w   w   w   w

If I want to do it with melt and dcast I would use this:

idvars <- grep("_20[0-9][0-9]$",names(DT) , invert = TRUE, value=T)   
temp <- melt(DT, id.vars = idvars)                                  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable),  yyear = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )
    ID Zaa Zab Zba Zbb yyear         aa         ab         ba         bb
 1:  1   c   c   c   c  2001 0.18852270 0.18852270 0.18852270 0.18852270
 2:  1   c   c   c   c  2002         NA 0.18852270         NA 0.18852270
 3:  2   z   z   z   z  2001 0.44315259 0.44315259 0.44315259 0.44315259
 4:  2   z   z   z   z  2002         NA 0.44315259         NA 0.44315259
 5:  3   d   d   d   d  2001 0.01910837 0.01910837 0.01910837 0.01910837
 6:  3   d   d   d   d  2002         NA 0.01910837         NA 0.01910837
 7:  4   e   e   e   e  2001 0.73697807 0.73697807 0.73697807 0.73697807
 8:  4   e   e   e   e  2002         NA 0.73697807         NA 0.73697807
 9:  5   w   w   w   w  2001 0.07062321 0.07062321 0.07062321 0.07062321
10:  5   w   w   w   w  2002         NA 0.07062321         NA 0.07062321

But I can't find the proper syntax with capture_melt_multiple.
I have tried different options such as capture_melt_multiple(DT, column=".*","_", dim=".*", fill=TRUE) and also using field. But I'm always getting errors.

My intention later is to benchmark this tool in order to find the fastest way to convert wide to long for large nn, for example nn=25, and nid=10000.

@tdhock
Copy link
Owner

tdhock commented Nov 17, 2020

Actually the syntax is good but "missing input columns" is not yet supported, I'm waiting for data table to merge my PR, see #12
If you want to try the new/experimental code:

> remotes::install_github(c("Rdatatable/data.table@fix4027", "tdhock/nc@multiple-fill"))

Skipping install of 'data.table' from a github remote, the SHA1 (854f1897) has not changed since last install.
  Use `force = TRUE` to force installation
Skipping install of 'nc' from a github remote, the SHA1 (bbc49eca) has not changed since last install.
  Use `force = TRUE` to force installation
> nc::capture_melt_multiple(DT, column=".*","_", dim=".*", fill=TRUE)
    ID Zaa Zab Zba Zbb  dim        aa        ab        ba        bb
 1:  1   z   z   z   z 2001 0.7256901 0.7256901 0.7256901 0.7256901
 2:  2   p   p   p   p 2001 0.4719875 0.4719875 0.4719875 0.4719875
 3:  3   b   b   b   b 2001 0.4384963 0.4384963 0.4384963 0.4384963
 4:  4   u   u   u   u 2001 0.9501824 0.9501824 0.9501824 0.9501824
 5:  5   o   o   o   o 2001 0.3648079 0.3648079 0.3648079 0.3648079
 6:  1   z   z   z   z 2002        NA 0.7256901        NA 0.7256901
 7:  2   p   p   p   p 2002        NA 0.4719875        NA 0.4719875
 8:  3   b   b   b   b 2002        NA 0.4384963        NA 0.4384963
 9:  4   u   u   u   u 2002        NA 0.9501824        NA 0.9501824
10:  5   o   o   o   o 2002        NA 0.3648079        NA 0.3648079
> nc::capture_melt_multiple(DT, column=".*","_", dim=".*", fill=FALSE)
Error in (function (subject.names, match.dt, no.match, fill = TRUE)  : 
  need dim=same count for each value, but have: 2001=4 2002=2; please change pattern, edit input column names, or use fill=TRUE to output missing values

@tdhock
Copy link
Owner

tdhock commented Nov 17, 2020

Also I did some benchmarks as well, varying the number of rows/columns, see Figures 4 and 5 in https://raw.githubusercontent.com/tdhock/nc-article/master/RJwrapper.pdf
Any other comments/feedback about that paper (currently in peer review) would be appreciated as well.
See also Rdatatable/data.table#4731 (comment) for some figures comparing computation time of various data table methods.

@tdhock tdhock closed this as completed Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants