-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
melt with custom variable columns using variable_table attribute #4731
Conversation
I expected that the new method should be as fast as, or faster than existing approaches, so I did some timings to verify that empirically (source code for timings experiments/plots). The approaches I considered are:
The first comparison figure plot the computation time as a function of the number of rows in the input: It is clear from the who data (above right) that (1) all data table methods are faster than tidyr, (2) new data table methods are slightly faster than old.join method for large data, and (3) new.pattern is slower than new.var_tab by less than 0.01 seconds, indicating that the overhead of calling measure(...) is very small. The iris data (above left) additionally show that (4) there is not much difference between new pattern/sep methods, and (5) both are comparable to old.set method. The second figure below plots the computation time as a function of the number of columns in the input: Similar trends are evident in this comparison, and we can also see that old.join is slower than old.set. Overall these timings provide convincing evidence that the new code is at least as fast as existing data table methods, and sometimes slightly faster. Additionally we see that the convenience of the measure(...) function only costs a very small amount in terms of computation time. |
Codecov Report
@@ Coverage Diff @@
## master #4731 +/- ##
==========================================
+ Coverage 99.44% 99.45% +0.01%
==========================================
Files 73 73
Lines 14469 14612 +143
==========================================
+ Hits 14388 14532 +144
+ Misses 81 80 -1
Continue to review full report at Codecov.
|
Example with iris data: remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (7a73a777) has not changed since last install.
#> Use `force = TRUE` to force installation
library(data.table)
iris.dt = data.table(iris)[c(1,150)]
melt(iris.dt, measure.vars=measure(part, value.name, sep="."))
#> Species part Length Width
#> 1: setosa Sepal 5.1 3.5
#> 2: virginica Sepal 5.9 3.0
#> 3: setosa Petal 1.4 0.2
#> 4: virginica Petal 5.1 1.8
melt(iris.dt, measure.vars=measure(value.name, dim, sep="."))
#> Species dim Sepal Petal
#> 1: setosa Length 5.1 1.4
#> 2: virginica Length 5.9 5.1
#> 3: setosa Width 3.5 0.2
#> 4: virginica Width 3.0 1.8
melt(iris.dt, measure.vars=measure(part, dim, sep="."))
#> Species part dim value
#> 1: setosa Sepal Length 5.1
#> 2: virginica Sepal Length 5.9
#> 3: setosa Sepal Width 3.5
#> 4: virginica Sepal Width 3.0
#> 5: setosa Petal Length 1.4
#> 6: virginica Petal Length 5.1
#> 7: setosa Petal Width 0.2
#> 8: virginica Petal Width 1.8 |
I don't have time right now to go through this in any depth, but the user functionality looks fantastic. I think it is intuitive and allows users to more simply access the speed/efficiency without, what I assume, any breaking changes. |
There are lots of conflicts in tests.Rraw due to the test numbers. At the time when this PR was initiated, the test numbers I used were new, but now they conflict with some test numbers which were added in other PRs which have been merged into master since then. Going forward what is the recommendation for choosing new test numbers in such a way as to avoid conflicts with other PRs? I checked the Contributing wiki page but I did not see any mention of this issue. |
That's a pain point with our test numbering system, we don't really have a workaround. But nothing needed on your end -- if that's the only conflict, it will be fixed by Matt during merge. |
Closes #2551
Closes #2575
Closes #3396
Background: there are a bunch of issues #3396 #2575 #2551 involving melt on a data table with column names that each have more than one piece of information e.g.
In these situations we would like output columns that have names shown in parentheses above, but with current melt the variable column is either the full column name (if there is one output value column),
or an integer (if there are multiple output value columns),
Comparison: tidyr::pivot_longer can be used with names_sep/names_pattern/names_transform arguments to get a more useful molten/tall output table in these cases.
Proposal: The goal of this PR is to fix these issues, and provide melt feature-parity with tidyr::pivot_longer.
The solution involves a new function
measure
which lets us do this if we want a single output value column,and we can use the special
value.name
keyword if we want multiple output value columns,Note in the code above that the output does not include the
variable
column at all. Instead the more relevant letter/number columns are output.Some points to discuss:
measure
for this new function or do we want to call it something else?measure
function allows specification of type conversion, e.g.,as.integer
above. (similar to names_transform argument of tidyr::pivot_longer).value.name
keyword (for consistency withvalue.name
argument) or do we change it to something else to avoid confusion?measure()
function returns a vector/list with a specialvariable_table
attribute which is a data table with columns letter/number. There is new fmelt C code which recognizes this attribute and uses it to create the desired output. Isvariable_table
a good name for this attribute or shall we call it something else?do_patterns
, and I renamed it toeval_with_cols
in order for it to support addingcols
argument tomeasure
as well aspatterns
. A nice new feature is that it will work with ANY user-provided function that has an argument namedcols
. e.g.,I know this is a really big PR --- please tell me if there is anything I can do to help make it easier to review.
Thanks.