Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

melt.data.table should offer variable to match on the name, rather than the number #3396

Closed
HughParsonage opened this issue Feb 13, 2019 · 10 comments · Fixed by #4731
Closed
Labels
enhancement feature request reshape dcast melt top request One of our most-requested issues
Milestone

Comments

@HughParsonage
Copy link
Member

Currently, using patterns makes the variable column an integer, which is both a bit awkward ...

library(data.table)
DT <- 
  data.table(x = 1:5, 
             y_No = 101:105,
             `y_$`= 1011:1015,
             z_No = 201:205,
             `z_$` = 301:305)
DT_melt <- 
  melt.data.table(DT, 
                  id.vars = "x",
                  measure.vars = patterns("Dollar" = "\\$$", 
                                          "Number" = "No$"))
DT_melt
#>     x variable Dollar Number
#>  1: 1        1   1011    101
#>  2: 2        1   1012    102
#>  3: 3        1   1013    103
#>  4: 4        1   1014    104
#>  5: 5        1   1015    105
#>  6: 1        2    301    201
#>  7: 2        2    302    202
#>  8: 3        2    303    203
#>  9: 4        2    304    204
#> 10: 5        2    305    205
# Intended
DT_melt[, variable_new := c("y", "z")[variable]][]
#>     x variable Dollar Number variable_new
#>  1: 1        1   1011    101            y
#>  2: 2        1   1012    102            y
#>  3: 3        1   1013    103            y
#>  4: 4        1   1014    104            y
#>  5: 5        1   1015    105            y
#>  6: 1        2    301    201            z
#>  7: 2        2    302    202            z
#>  8: 3        2    303    203            z
#>  9: 4        2    304    204            z
#> 10: 5        2    305    205            z

... and potentially error-prone, e.g.

library(data.table)
DT <- 
  data.table(x = 1:5, 
             y_No = 101:105,
             `y_$`= 1011:1015,
             z_No = 201:205,
             `z_$` = 301:305,
             `w1_$` = 401:405,
             w2_No = 501:505)
melt.data.table(DT, 
                id.vars = "x",
                measure.vars = patterns("Dollar" = "\\$$", 
                                        "Number" = "No$"))
#>     x variable Dollar Number
#>  1: 1        1   1011    101
#>  2: 2        1   1012    102
#>  3: 3        1   1013    103
#>  4: 4        1   1014    104
#>  5: 5        1   1015    105
#>  6: 1        2    301    201
#>  7: 2        2    302    202
#>  8: 3        2    303    203
#>  9: 4        2    304    204
#> 10: 5        2    305    205
#> 11: 1        3    401    501    # incorrect
#> 12: 2        3    402    502    # incorrect
#> 13: 3        3    403    503    # incorrect
#> 14: 4        3    404    504    # incorrect
#> 15: 5        3    405    505    # incorrect

Created on 2019-02-13 by the reprex package (v0.2.1)

Can offer a PR if this is a worthwhile feature. Otherwise can just attempt the bugfix.

@MichaelChirico
Copy link
Member

I'm starting at this and not seeing what's incorrect?

And maybe duplicate of #2551? The functionality is certainly useful...

@HughParsonage
Copy link
Member Author

I think it's the same as that one though it didn't appear in the searches and I think my reprex is clearer.

I think the result of the second melt should be:

#>     x variable Dollar Number
#>  1: 1        1   1011    101
#>  2: 2        1   1012    102
#>  3: 3        1   1013    103
#>  4: 4        1   1014    104
#>  5: 5        1   1015    105
#>  6: 1        2    301    201
#>  7: 2        2    302    202
#>  8: 3        2    303    203
#>  9: 4        2    304    204
#> 10: 5        2    305    205
#> 11: 1        3    401     NA
#> 12: 2        3    402     NA
#> 13: 3        3    403     NA
#> 14: 4        3    404     NA
#> 15: 5        3    405     NA
#> 16: 1        4     NA    501
#> 17: 2        4     NA    502
#> 18: 3        4     NA    503
#> 19: 4        4     NA    504
#> 20: 5        4     NA    505

Created on 2019-02-13 by the reprex package (v0.2.1)

@arunsrinivasan
Copy link
Member

variable col is factor, not integer. And I believe this is clearly documented? I agree having the levels right would make sense. But patterns is for convenience. The original idea of being able to melt on any columns provided as a list is the original purpose of melt. There's no reason why patterns should automatically match measure vars automatically. Perhaps this could be done using an additional argument. The default functionality should be as it is.

@HughParsonage
Copy link
Member Author

Arun, happy to leave the existing functionality as-is (was planning to), but I do think that the second example is a bug: there should be four levels, not three.

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Feb 17, 2019

@HughParsonage patterns doesn't implicitly match on column names.. it just resolves columns into a list of column names/indices for every pattern provided. So I wouldn't classify it as a bug.

What you require could be achieved by extending the functionality of patterns (or even another function), and I think it's extremely useful too..

For e.g.,

melt(dt, id="a", measure=patterns("b$", "c$", fill=TRUE))

or strict_patterns(...) or something like that. What do you think?

@franknarf1
Copy link
Contributor

Regarding syntax for applying strict matching, I guess it would be necessary to identify which part of the string to match across the patterns. This could be done in a capture group that additionally could be used to populate the variable factor column's levels, I imagine, like:

melt(DT, id="x", measure=patterns2(Dollar = "^(.+)_\\$$", Number = "^(.+)_No$"))

melt(df, id="id", measure=patterns2(purchased = "^brand(.+)$", price = "^price(.+)$")

(The latter example from #3487 highlighted here to show the connection I see between the two issues to the OP @jsams )

So the two (.*) would have to capture the same content across entries of patterns2(...) to appear on the same row in the end.

@tdhock
Copy link
Member

tdhock commented Sep 29, 2020

hi this is related to #4027 and fixed by #4720, not merged into master yet but can be used via

remotes::install_github(c("Rdatatable/data.table@fix4027", "tdhock/nc@multiple-fill"))
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
#> Skipping install of 'nc' from a github remote, the SHA1 (11b61f8e) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
DT <- 
  data.table(x = 1:5, 
             y_No = 101:105,
             `y_$`= 1011:1015,
             z_No = 201:205,
             `z_$` = 301:305)
nc::capture_melt_multiple(
  DT,
  variable_new="[yz]",
  "_",
  column=".*", function(x)ifelse(x=="$", "Dollar", "Number"))
#>     x variable_new Dollar Number
#>  1: 1            y   1011    101
#>  2: 2            y   1012    102
#>  3: 3            y   1013    103
#>  4: 4            y   1014    104
#>  5: 5            y   1015    105
#>  6: 1            z    301    201
#>  7: 2            z    302    202
#>  8: 3            z    303    203
#>  9: 4            z    304    204
#> 10: 5            z    305    205
DT2 <- 
  data.table(x = 1:5, 
             y_No = 101:105,
             `y_$`= 1011:1015,
             z_No = 201:205,
             `z_$` = 301:305,
             `w1_$` = 401:405,
             w2_No = 501:505)  
nc::capture_melt_multiple(
  DT2,
  variable_new=".*",
  "_",
  column=".*", function(x)ifelse(x=="$", "Dollar", "Number"),
  fill=TRUE)
#>     x variable_new Dollar Number
#>  1: 1           w1    401     NA
#>  2: 2           w1    402     NA
#>  3: 3           w1    403     NA
#>  4: 4           w1    404     NA
#>  5: 5           w1    405     NA
#>  6: 1           w2     NA    501
#>  7: 2           w2     NA    502
#>  8: 3           w2     NA    503
#>  9: 4           w2     NA    504
#> 10: 5           w2     NA    505
#> 11: 1            y   1011    101
#> 12: 2            y   1012    102
#> 13: 3            y   1013    103
#> 14: 4            y   1014    104
#> 15: 5            y   1015    105
#> 16: 1            z    301    201
#> 17: 2            z    302    202
#> 18: 3            z    303    203
#> 19: 4            z    304    204
#> 20: 5            z    305    205

@tdhock tdhock closed this as completed Sep 29, 2020
@jangorecki
Copy link
Member

Let's wait till it is merged

@jangorecki jangorecki reopened this Sep 29, 2020
@tdhock
Copy link
Member

tdhock commented Oct 1, 2020

hi again the solution using only the new data.table::melt (without nc/regex) looks like

remotes::install_github("Rdatatable/data.table@fix4027")
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
DT2 <- 
  data.table(x = 1:5, 
             y_No = 101:105,
             `y_$`= 1011:1015,
             z_No = 201:205,
             `z_$` = 301:305,
             `w1_$` = 401:405,
             w2_No = 501:505)  
DT2.tall <- melt(DT2, measure=list(
  Dollar=c("w1_$", NA, "y_$", "z_$"),
  Number=c(NA, "w2_No", "y_No", "z_No")))
DT2.tall[, variable_new := c("w1", "w2", "y", "z")[variable] ]
DT2.tall
#>     x variable Dollar Number variable_new
#>  1: 1        1    401     NA           w1
#>  2: 2        1    402     NA           w1
#>  3: 3        1    403     NA           w1
#>  4: 4        1    404     NA           w1
#>  5: 5        1    405     NA           w1
#>  6: 1        2     NA    501           w2
#>  7: 2        2     NA    502           w2
#>  8: 3        2     NA    503           w2
#>  9: 4        2     NA    504           w2
#> 10: 5        2     NA    505           w2
#> 11: 1        3   1011    101            y
#> 12: 2        3   1012    102            y
#> 13: 3        3   1013    103            y
#> 14: 4        3   1014    104            y
#> 15: 5        3   1015    105            y
#> 16: 1        4    301    201            z
#> 17: 2        4    302    202            z
#> 18: 3        4    303    203            z
#> 19: 4        4    304    204            z
#> 20: 5        4    305    205            z

@tdhock
Copy link
Member

tdhock commented Oct 7, 2020

pure data.table solution using #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
DT2 <- 
  data.table(x = 1:5, 
             y_No = 101:105,
             `y_$`= 1011:1015,
             z_No = 201:205,
             `z_$` = 301:305,
             `w1_$` = 401:405,
             w2_No = 501:505)  
melt(DT2, measure.vars=measure(
  variable_new, value.name=function(x)c("$"="Dollar","No"="Number")[x]))
#>     x variable_new Number Dollar
#>  1: 1            y    101   1011
#>  2: 2            y    102   1012
#>  3: 3            y    103   1013
#>  4: 4            y    104   1014
#>  5: 5            y    105   1015
#>  6: 1            z    201    301
#>  7: 2            z    202    302
#>  8: 3            z    203    303
#>  9: 4            z    204    304
#> 10: 5            z    205    305
#> 11: 1           w1     NA    401
#> 12: 2           w1     NA    402
#> 13: 3           w1     NA    403
#> 14: 4           w1     NA    404
#> 15: 5           w1     NA    405
#> 16: 1           w2    501     NA
#> 17: 2           w2    502     NA
#> 18: 3           w2    503     NA
#> 19: 4           w2    504     NA
#> 20: 5           w2    505     NA

@mattdowle mattdowle added this to the 1.14.1 milestone May 9, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement feature request reshape dcast melt top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants